In [None]:
# - Book    https://wesmckinney.com/book/python-builtin#set
# - Cheatsheet https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
# - DataSet     https://www.kaggle.com/account/login?titleType=dataset-downloads&showDatasetDownloadSkip=False&messageId=datasetsWelcome


What is pandas? 
Pandas is a powerful data manipulation library in Python. It provides high-level data structures and data analysis tools for the Python programming language. 
A high-level data manipulation tool built on the Numpy package.
Designed to make data cleaning and analysis quick and easy in Python.

Core components: 
Series: One-dimensional labeled arrays.
DataFrame: Two-dimensional labeled data structures, much like a table in a database, an Excel spreadsheet, or a data frame in R.

In [None]:
## Open Conda and check the available environments

# (base) C:/users> conda env list

## To remove conda environments:
# >conda remove -n enironment_name  # or try
# >conda env remove -n env_name

## To create new environment
#conda create --name myenv python=3.7   # create myenv environment with Python version 3.7
# conda create -n python_eda python     # create environement and python

## Activate an existing environment
#conda activate myenv

## Deactivate current active environment
#conda deactivate

## To List all conda environments
# conda env list

## To list 
#conda info --envs


## Install packages in specific environment
#conda install -n myenv numpy pandas matplotlib seaborn scikit-learn



In [None]:
# (base) > conda env list
# (base) > conda create -n python_eda python3.11

## once environemt created, enter into environment
# (base) > conda activate python_eda
# (pthon_eda) > conda list    # it will show all installed fiiles

## To install kernel
# (pthon_eda) > pip install ipkernel

## To install Pandas library
# (pthon_eda) > pip install pandas

## To install openxl library
# (pthon_eda) > pip install pyopenxl



#### OPEN new file with Python Notebook extension ipynb, like pandas.ipynb

#import pandas as pd

## Read the data
#df = pd.read_csv('./data/dataset.csv')

## Write the data. we will need to install the pyopenxl library
# df.to_excel('datafile.xlsx')

## To Read Excel File
# df_xl = pd.read_excel('datafile.xlsx')

## print the variable or dataset in Terminal 
# print(df)

## How data looks like
# df.info()

## To print data types
#df.dtypes 

## Print top 5 rows
# df.head(10)

## print Last 5 rows
# df.tail(5)

## To describe all the data in short summary
# df.describe()

## Check for missing values
# df.isnull().sum()

## Drop NA / Null value rows
# df=df.dropna()

## check if there is any duplicate row
# df.duplicated().sum()










Pandas Tips


## Setting up the Environment 
- Ensure you have Python and pip installed in separate coda environment
- Install pandas with pip install pandas
- Use Jupyter Notebooks or any Python environment to interactively work with pandas.

## Dive into Basic Operations
- Loading Data: Understand how to read data from various sources like CSV, Excel, SQL databases.
<code> import pandas as pd
data = pd.read_csv('datafile.csv')
</code>

- Viewing Data: 
Use commands like <code> head(), tail(), info() </code> and <code> describe() </code> to get an overview of your dataset.

- Indexing & Selecting Data: 
Get to grips with <code> .loc[], .iloc[], </code> and conditional selection.

- Exploring Data: Know about basic functions such as <code>.head()</code>, <code>.info()</code> for getting
a quick look at your dataset.
- Filtering Data: Learn about filtering using boolean indexing.
- Sorting & Ordering: Get familiar with sorting and ordering operations on a dataframe.

- Cleaning Data: Learn how to clean missing values using dropna(), fillna() etc.,
renaming columns, selecting specific rows/columns.

- Selecting Data: Learn about different ways of selecting specific rows/columns using labels, indices, slicing etc.

- Selecting Columns: How do we select a particular column? We can use square brackets [] after our dataframe object to access columns by name.
- Selecting Rows & Columns : How do we select rows/columns of a DataFrame? We can use loc[] for label based indexing (select by 


- Selecting Rows & Columns : How do we select rows/columns of a DataFrame?
<code> df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
print(df)
#Select column B
print("Column B:")
print(df['B'])
#Select row where A is greater than 2
print("\nRow Where A > 2")
print(df[df['A'] > 2])
</code>


- Selecting Data: Learn about selecting specific rows/columns using loc[] function.
<code> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.loc[0, ['A', 'B']] # Returns a series of columns A and B for row index 0
df.loc[[0, 1]] # Returns all rows at indices 0 and 1
</code>


- Selecting Data: Learn about selecting specific rows/columns using indexing and slicing techniques.
<code>df[0] #first row of dataframe
df[:3] # first three rows of a dataframe
df['columnname'] #select column by name
df[['col1','col2']] # select multiple columns at once
</code>

- Filtering Data: Explore different ways of filtering data based on conditions such as greater than, less than, equal to etc.

- Filtering Rows & Columns: Filter out unwanted values based on conditions that we specify for each element (row) / value pair(column).

- Filtering Data: Explore different ways to filter data based on conditions.
<code>#filter out all records where columm 'A' is greater than value x
df[(df['A'] > x)]
</code>

- Sorting Data: Get familiar with sorting methods for both ascending and descending order.
<code> df.sort_values(by='ColumnName',ascending=True)</code>
- Group By: Understand the concept of grouping in pandas and use it to perform various operations like mean, sum, max or min.
- Group By: Understand the concept of grouping in pandas.
<code>grouped = df.groupby('ColumnName')
for key, group in grouped:
print (key)
print (group)</code>

- Group By & Aggregate: Perform group by operations and aggregate functions such as mean(), sum(), max().
<code>df.groupby(['Column Name']).mean()</code>

## sort values in decreasing order
<code> df.sort_index() </code>


## Advanced Topics

### Missing Values
Handling missing values can be tricky but it’s an important part of working with real world datasets. Let’s explore some basic strategies.
Handling missing values can be crucial when working with real world datasets. Pandas provides several options to handle them.
Handling missing values can be tricky but it’s an important part of working with datasets. Here are some basic steps to handle them:
- How do we handle missing values?
- What are common strategies to deal with them?

#### Dealing With Nulls (NaN's):
- Check if there are NaN's present in your dataset.
<code>df.isnull().sum()</code>
- Drop Rows containing null values - dropna():
<code>df.dropna()</code>
- Fill NaN's with some meaningful value fillna():
<code>df.fillna({'Age':5})</code>
- Forward filling NaN's:
<code>df.ffill()</code>
- Backward filling NaN's:
<code>df.bfill()</code>
- Interpolation method to fill NaN's:
<code>df.interpolate()</code>

### Duplicate Entries
- Find duplicate entries within the same DataFrame.
<code>duplicates = df[df.duplicated(keep=False)]</code>
- Remove duplicates:
<code>df.drop_duplicates(inplace=True)</code>

### Merging, Joining, Concatenating
- Different types of joins available: inner join, outer join, left join, right join, full join.
- Example code:
<code>merged_dataframe = pd.merge(left=df1,right=df2,on="common_column")</code>
- Concat function can be used to combine two or more dataframes along any axis.
<code>pd.concat([df1,df2],axis=0)</code>



### Data Cleaning 🧹
- Handling Missing Data: Utilize methods like dropna(), fillna(), and understand the importance of inplace parameter.

- Data Type Conversion: Grasp astype() to convert data types and understand pandas’ native data types.

- Removing Duplicates: Employ drop_duplicates() to maintain data integrity.

### Data Manipulation & Analysis 📈
- Aggregation: Use powerful grouping and aggregation tools like groupby(), pivot_table(), and crosstab().

- String Operations: Dive into the .str accessor for essential string operations within Series.

- Merging, Joining, and Concatenating: Understand the differences and applications of merge(), join(), and concat().
- Date Time Operations: Understand how to work with datetime objects using to_datetime() and date_range().
- Missing Values Handling: Learn about handling missing values such as dropping rows/columns with NaN's, forward filling,

- Reshaping Data: Grasp melt() and pivot() for transforming datasets.

## 3. Pandas Visualization 🐼🐼
- Basic plotting functions: use hist(), boxplot(), scatter(), etc., on a dataframe.
- Advanced visualizations: bar plots, line charts, heatmaps, correlation matrices, etc.
- Customizing Plots: customize colors, labels, sizes, titles, etc.
## 4. Python Libraries Used In Data Science 💻
- Numpy
- Matplotlib
- Seaborn

## Advanced Features 🎩
- Time Series in pandas: Work with date-time data, resampling, and shifting.

- Categorical Data: Understand pandas’ categorical type and its advantages.

- Styling: Style your DataFrame output for better visualization in Jupyter Notebooks.
Handling Missing Values: Fill missing values with mean/median/mode/etc.

- Data Wrangling: Perform complex transformations such as multiple column conversions, merges, and splits.

Practice Problems 🏋️‍♀️
Apply these concepts to solve real world problems related to data analysis and manipulation.

## Optimization & Scaling 🚀
- Efficiently using Data Types: Use category type for object columns with few unique values to save memory.

- Method Chaining: Reduce the readability problem of pandas and improve performance.

- Use eval() & query(): High-performance operations, leveraging string expressions.

## Pandas’ Ecosystem 🌍
- Other Libraries: Explore libraries like Dask for parallel computing and Vaex for handling large datasets.

- Visualization: While pandas itself has visualization capabilities, integrating it with Matplotlib and Seaborn can enhance your data visualization game.

