# Pandas for Machine Learning

**DataFrame:**

* Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure.
It allows you to store and manipulate labeled data, making it ideal for handling datasets.

**Data Cleaning:**

* Pandas provides functions for handling missing data, such as dropna() and fillna().
It allows you to filter or replace specific values in a DataFrame.

**Data Exploration:**

* You can use Pandas for basic statistical analysis, summary statistics, and data exploration.
Functions like describe() and info() provide insights into the dataset.

**Data Preparation:**

* Pandas simplifies data preprocessing tasks like encoding categorical variables (get_dummies()), scaling, and normalizing data.
It supports efficient handling of datetime data.

In [None]:
# Explain the need of Pandas when we have Numpy

## DataFrame

* Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 
* Pandas DataFrame consists of three principal components, **the data, rows, and columns.**

In [None]:
# Create dataframe using dict

In [None]:
# Create dataframe using dict (keys : list of values), index, cols

In [None]:
#Reading data from csv

In [2]:
import pandas as pd

In [4]:
df = pd.read_csv('data\\nba.csv')


In [None]:
# Exploring the dataset -- Importance
# methods -- info, describe,head(2),tail, sample
# attributes - columns, shape

In [None]:
# Feature Selection in Machine Learning
# Single cols/ multiple columns

In [None]:
# Selecting some rows and some columns

# df.loc, df.iloc

# Data Cleaning -- Cleaning missing values (NaN)
* Clean data ensures the accuracy of machine learning model built on it. Missing values can lead to inaccurate results.


* Many machine learning algorithms cannot handle missing values.


*  Cleaned data results in better model performance.


* Cleaned data improves overall data quality, making it more reliable and trustworthy for decision-making.

In [6]:
data = pd.read_csv("data\\employees.csv") 

In [None]:
# data.isnull() ,data.isnull().sum()
#check data.sample(10)/ df.info()

In [None]:
# Method 1 imputing
# using dropna() function along rows
# data_dropna = data.dropna(axis = 0)

In [None]:
# Method 2 imputing
# fillna(): The fillna() method is employed to fill missing values with a specified value or using various filling strategies
# such as forward fill or backward fill.
# filling a null values using fillna() --> Imputing
data["Gender"].fillna("No Gender", inplace = True) 

data.sample(15,random_state=42)

In [None]:
# filling a null values using fillna() 
data["Team"].fillna("No Team", inplace = True) 

data.sample(20,random_state=42)

In [None]:
# Method 3 imputing
import numpy as np
# will replace Nan value in dataframe with value -99 
data_replace = data.replace(to_replace = np.nan, value = -99) 

In [None]:
# ----------------Excercise-----------------------------------------

# Exploratory data Analysis

# Data Filtering
We have seen comparision operators
* Greater than > 
* Less than < 
* Equal to ==
* Not equal to !=
* Greater than or equal to >=
* Less than or equal to <=


* Boolean operators -- AND OR NOT

In [None]:
# How to get data of all Males?


In [None]:
# How many people have salary greator than 40000?

In [None]:
# Filter data for salary less than 40000


In [None]:
# I want data of all employees who work for Marketing team


In [None]:
# I want data of all employees who work except for Marketing team


In [None]:
# How many male employees work for Finance team ? Bool


In [None]:
# How many Senior managers have salary greator than 70000?


In [None]:
# Working with dates

In [8]:
data['Start Date']

0        8/6/1993
1       3/31/1996
2       4/23/1993
3        3/4/2005
4       1/24/1998
          ...    
995    11/23/2014
996     1/31/1984
997     5/20/2013
998     4/20/2013
999     5/15/2012
Name: Start Date, Length: 1000, dtype: object

In [None]:
#dtype object

In [9]:
data['Start Date'] = pd.to_datetime(data['Start Date'])

In [10]:
data['Start Date']

0     1993-08-06
1     1996-03-31
2     1993-04-23
3     2005-03-04
4     1998-01-24
         ...    
995   2014-11-23
996   1984-01-31
997   2013-05-20
998   2013-04-20
999   2012-05-15
Name: Start Date, Length: 1000, dtype: datetime64[ns]

In [11]:
data['year'] = pd.DatetimeIndex(data['Start Date']).year
data['month'] = pd.DatetimeIndex(data['Start Date']).month
data['day'] = pd.DatetimeIndex(data['Start Date']).day

In [None]:
# How may female employees joined after 2000?

# Group by

We summarize the daywise data and create a new dataframe with month-wise data. This is where the groupby funtion is useful. Along with a grouping, we need to specify a way to aggregate the data for each group.

In [None]:
# I want to find average salary of Male and Female and average Bonus they got

In [None]:
data_month = data.groupby('Gender')[['Salary', 'Bonus %']].mean()

In [None]:
# I want to fine teamwise average salary and average Bonus.