# <font color="Red"><h3 align="center">Table of Contents</h3></font>

1. Introduction and Installation 
2. DataFrame Basics 
3. Read Write Excel CSV File
4. Different Ways Of Creating DataFrame
5. Handle Missing Data: fillna, dropna, interpolate
6. Handle Missing Data: replace function
7. Concat Dataframes
8. Pivot table
9. Pandas Crosstab 

# <font color="Blue"><h3 align="center">1.Introduction and Installation</h3></font>

In [None]:
from IPython.display import Image
Image(filename='pandas.png')

> [Pandas](https://pandas.pydata.org/pandas-docs/stable/) is the typical tool a data scientist grabs first. It is based around a lot of the [NumPy package](https://docs.scipy.org/doc/numpy/reference/) so a familiarity with NumPy will help understand how to use Pandas. However, Pandas has a lot of specific extras that can be very useful to a data scientist!
> 
>Pandas is also a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In [None]:
!pip install pandas 

# <font color="Green"><h3 align="center">2.DataFrame Basics</h3></font>

> Dataframe is most commonly used object in pandas. It is a table like datastructure containing rows and columns similar to excel spreadsheet

In [None]:
import pandas as pd
weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed': [6,7,2,7,4,2],
    'event': ['Rain', 'Sunny', 'Snow','Snow','Rain', 'Sunny']
}
df = pd.DataFrame(weather_data)
df

In [None]:
df.shape # rows, columns shape

## <font color='blue'>Rows</font>

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df[1:3]

## <font color='blue'>Columns</font>

In [None]:
df.columns

In [None]:
df['day']

In [None]:
type(df['day'])

In [None]:
df[['day','temperature']]

## <font color='blue'>Operations On DataFrame</font>

In [None]:
df['temperature'].max()

In [None]:
df[df['temperature']>32]

In [None]:
df['day'][df['temperature'] == df['temperature'].max()] # doing SQL in pandas

In [None]:
df[df['temperature'] == df['temperature'].max()] #  doing SQL in pandas

In [None]:
df['temperature'].std()

In [None]:
df['event'].max() # But mean() won't work since data type is string

In [None]:
df.describe()

## <font color='blue'>set_index</font>

In [None]:
df.set_index('day')

In [None]:
df.set_index('day', inplace=True)

In [None]:
df.index

In [None]:
df.loc['1/2/2017']

In [None]:
df.reset_index(inplace=True)
df.head()

In [None]:
df.set_index('event',inplace=True) # this is kind of building a hash map using event as a key
df

In [None]:
df.loc['Snow']

# <font color="TEAL"><h3 align="center">3.Read Write Excle CSV File</h3></font>

### <font color="blue">Write to CSV</color>

In [2]:
df.to_csv("new.csv", index=False)

NameError: name 'df' is not defined

### <font color="blue">Read CSV</color>

In [1]:
df = pd.read_csv("new.csv")
df

NameError: name 'pd' is not defined

In [5]:
df = pd.read_csv("new.csv", header=None, names = ["ticker","eps","revenue","people"])
df

NameError: name 'pd' is not defined

In [6]:
df = pd.read_csv("new.csv",  nrows=2)
df

NameError: name 'pd' is not defined

### <font color="blue">Write to Excel</color>

In [7]:
df.to_excel("new.xlsx", sheet_name="weather", index=False, startrow=2, startcol=1)

NameError: name 'df' is not defined

In [8]:
df_stocks = pd.DataFrame({
    'tickers': ['GOOGL', 'WMT', 'MSFT'],
    'price': [845, 65, 64 ],
    'pe': [30.37, 14.26, 30.97],
    'eps': [27.82, 4.61, 2.12]
})

df_weather =  pd.DataFrame({
    'day': ['1/1/2017','1/2/2017','1/3/2017'],
    'temperature': [32,35,28],
    'event': ['Rain', 'Sunny', 'Snow']
})

NameError: name 'pd' is not defined

In [9]:
with pd.ExcelWriter('stocks_weather.xlsx') as writer:
    df_stocks.to_excel(writer, sheet_name="stocks")
    df_weather.to_excel(writer, sheet_name="weather")

NameError: name 'pd' is not defined

### <font color="blue">Read  Excel</color>

In [11]:
df = pd.read_excel("new.xlsx","weather")
df

NameError: name 'pd' is not defined

Excel data replace using **function**

In [12]:
def convert_people_cell(cell):
    if cell=="n.a.":
        return 'Sam Walton'
    return cell

def convert_price_cell(cell):
    if cell=="n.a.":
        return 50
    return cell
    
df = pd.read_excel("new.xlsx","weather", converters= {
        'people': convert_people_cell,
        'price': convert_price_cell
    })
df

NameError: name 'pd' is not defined

### <font color="blue">Write to JSON</color>

In [13]:
df.to_json('new.json')

NameError: name 'df' is not defined

### <font color="blue">Read JSON</color>

In [14]:
weather_df = pd.read_json('new.json')
weather_df.head()

NameError: name 'pd' is not defined

# <font color="purple"><h3 align="center">4.Different Ways Of Creating Dataframe</h3></font>

## <font color="green">Using csv</h3></font>

In [None]:
df = pd.read_csv("weather_data.csv")
df

## <font color="green">Using excel</h3></font>

In [None]:
df=pd.read_excel("weather_data.xlsx","Sheet1")
df

## <font color="green">Using dictionary</h3></font>

In [None]:
import pandas as pd
weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017'],
    'temperature': [32,35,28],
    'windspeed': [6,7,2],
    'event': ['Rain', 'Sunny', 'Snow']
}
df = pd.DataFrame(weather_data)
df

## <font color="green">Using tuples list</h3></font>

In [None]:
weather_data = [
    ('1/1/2017',32,6,'Rain'),
    ('1/2/2017',35,7,'Sunny'),
    ('1/3/2017',28,2,'Snow')
]
df = pd.DataFrame(data=weather_data, columns=['day','temperature','windspeed','event'])
df

## <font color="green">Using list of dictionaries</h3></font>

In [None]:
weather_data = [
    {'day': '1/1/2017', 'temperature': 32, 'windspeed': 6, 'event': 'Rain'},
    {'day': '1/2/2017', 'temperature': 35, 'windspeed': 7, 'event': 'Sunny'},
    {'day': '1/3/2017', 'temperature': 28, 'windspeed': 2, 'event': 'Snow'},
    
]
df = pd.DataFrame(data=weather_data, columns=['day','temperature','windspeed','event'])
df

## <font color="green">Using JSON</h3></font>

In [None]:
weather_df = pd.read_json('weather_data.json')
weather_df.head()

## <font color="maroon"><h4 align="center">5.Handling Missing Data - fillna, interpolate, dropna</font>

In [15]:
import pandas as pd
df = pd.read_csv("weather_data.csv",parse_dates=['day'])
type(df.day[0])
df

FileNotFoundError: [Errno 2] File b'weather_data.csv' does not exist: b'weather_data.csv'

In [16]:
df.set_index('day',inplace=True)
df

NameError: name 'df' is not defined

## <font color="blue">fillna</font>
<font color="purple">**Fill all NaN with one specific value**</font>

In [17]:
new_df = df.fillna(0)
new_df

NameError: name 'df' is not defined

<font color="purple">**Fill na using column names and dict**</font>

In [18]:
new_df = df.fillna({
        'temperature': 0,
        'windspeed': 0,
        'event': 'No Event'
    })
new_df

NameError: name 'df' is not defined

<font color="purple">**Use method to determine how to fill na values**</font>

In [19]:
new_df = df.fillna(method="ffill")
new_df

NameError: name 'df' is not defined

In [20]:
new_df = df.fillna(method="bfill")
new_df

NameError: name 'df' is not defined

<font color="purple">**Use of axis**</font>

In [21]:
new_df = df.fillna(method="bfill", axis="columns") # axis is either "index" or "columns"
new_df

NameError: name 'df' is not defined

<font color="purple">**limit parameter**</font>

In [22]:
new_df = df.fillna(method="ffill",limit=1)
new_df

NameError: name 'df' is not defined

### <font color="blue">interpolate</font>

In [23]:
new_df = df.interpolate()
new_df

NameError: name 'df' is not defined

### <font color="blue">dropna</font>

In [24]:
new_df = df.dropna()
new_df

NameError: name 'df' is not defined

In [25]:
new_df = df.dropna(how='all')
new_df

NameError: name 'df' is not defined

### <font color="blue">Inserting Missing Dates</font>

In [26]:
dt = pd.date_range("01-01-2017","01-11-2017")
idx = pd.DatetimeIndex(dt)
df = df.reindex(idx)
df

NameError: name 'df' is not defined

## <font color="NAVY"><h4 align="center">6.Handling Missing Data - replace method</font>

**Replacing single value**

In [27]:
new_df = df.replace(-99999, value=np.NaN)
new_df

NameError: name 'df' is not defined

**Replacing per column**

In [28]:
new_df = df.replace({
        'temperature': -99999,
        'windspeed': -99999,
        'event': '0'
    }, np.nan)
new_df

NameError: name 'df' is not defined

**Replacing by using mapping**

In [29]:
new_df = df.replace({
        -99999: np.nan,
        'no event': 'Sunny',
    })
new_df

NameError: name 'df' is not defined

**Replacing list with another list**

In [30]:
df = pd.DataFrame({
    'score': ['exceptional','average', 'good', 'poor', 'average', 'exceptional'],
    'student': ['rob', 'maya', 'parthiv', 'tom', 'julian', 'erica']
})
df

Unnamed: 0,score,student
0,exceptional,rob
1,average,maya
2,good,parthiv
3,poor,tom
4,average,julian
5,exceptional,erica


In [31]:
df.replace(['poor', 'average', 'good', 'exceptional'], [1,2,3,4])

Unnamed: 0,score,student
0,4,rob
1,2,maya
2,3,parthiv
3,1,tom
4,2,julian
5,4,erica


# <font color="purple"><h3 align="center">7.Pandas Concatenate</h3></font>

## <font color='blue'>Basic Concatenation</font>

In [32]:
import pandas as pd

india_weather = pd.DataFrame({
    "city": ["mumbai","delhi","banglore"],
    "temperature": [32,45,30],
    "humidity": [80, 60, 78]
})
india_weather

Unnamed: 0,city,temperature,humidity
0,mumbai,32,80
1,delhi,45,60
2,banglore,30,78


In [33]:
us_weather = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
    "humidity": [68, 65, 75]
})
us_weather

Unnamed: 0,city,temperature,humidity
0,new york,21,68
1,chicago,14,65
2,orlando,35,75


In [34]:
df = pd.concat([india_weather, us_weather])
df

Unnamed: 0,city,temperature,humidity
0,mumbai,32,80
1,delhi,45,60
2,banglore,30,78
0,new york,21,68
1,chicago,14,65
2,orlando,35,75


## <font color='blue'>Ignore Index</font>

In [35]:
df = pd.concat([india_weather, us_weather], ignore_index=True)
df

Unnamed: 0,city,temperature,humidity
0,mumbai,32,80
1,delhi,45,60
2,banglore,30,78
3,new york,21,68
4,chicago,14,65
5,orlando,35,75


## <font color='blue'>Concatenation And Keys</font>

In [36]:
df = pd.concat([india_weather, us_weather], keys=["india", "us"])
df

Unnamed: 0,Unnamed: 1,city,temperature,humidity
india,0,mumbai,32,80
india,1,delhi,45,60
india,2,banglore,30,78
us,0,new york,21,68
us,1,chicago,14,65
us,2,orlando,35,75


In [37]:
df.loc["us"]

Unnamed: 0,city,temperature,humidity
0,new york,21,68
1,chicago,14,65
2,orlando,35,75


In [38]:
df.loc["india"]

Unnamed: 0,city,temperature,humidity
0,mumbai,32,80
1,delhi,45,60
2,banglore,30,78


## <font color='blue'>Concatenation Using Index</font>

In [39]:
temperature_df = pd.DataFrame({
    "city": ["mumbai","delhi","banglore"],
    "temperature": [32,45,30],
}, index=[0,1,2])
temperature_df

Unnamed: 0,city,temperature
0,mumbai,32
1,delhi,45
2,banglore,30


In [40]:
windspeed_df = pd.DataFrame({
    "city": ["delhi","mumbai"],
    "windspeed": [7,12],
}, index=[1,0])
windspeed_df

Unnamed: 0,city,windspeed
1,delhi,7
0,mumbai,12


In [41]:
df = pd.concat([temperature_df,windspeed_df],axis=1)
df

Unnamed: 0,city,temperature,city.1,windspeed
0,mumbai,32,mumbai,12.0
1,delhi,45,delhi,7.0
2,banglore,30,,


## <font color='blue'>Concatenate dataframe with series</font>

In [42]:
s = pd.Series(["Humid","Dry","Rain"], name="event")
s

0    Humid
1      Dry
2     Rain
Name: event, dtype: object

In [43]:
df = pd.concat([temperature_df,s],axis=1)
df

Unnamed: 0,city,temperature,event
0,mumbai,32,Humid
1,delhi,45,Dry
2,banglore,30,Rain


# <font color="OLIVE"><h3 align="center">8.Pandas Pivot table</h3></font>

<h1 style="color:blue">Pivot basics</h1>

In [49]:
import pandas as pd
import numpy as np
df = pd.read_csv("weather1.csv")
df

FileNotFoundError: [Errno 2] File b'weather1.csv' does not exist: b'weather1.csv'

In [45]:
df.pivot(index='city',columns='date')

KeyError: "None of ['date'] are in the columns"

In [46]:
df.pivot(index='city',columns='date',values="humidity")

KeyError: 'date'

In [47]:
df.pivot(index='date',columns='city')

KeyError: "None of ['date'] are in the columns"

In [48]:
df.pivot(index='humidity',columns='city')

KeyError: "None of ['humidity'] are in the columns"

<h1 style="color:blue">Pivot Table</h1>

In [50]:
df.pivot_table(index="city",columns="date")

KeyError: 'date'

<h2 style="color:brown">Grouper</h2>

In [51]:
df['date'] = pd.to_datetime(df['date'])

KeyError: 'date'

In [52]:
df.pivot_table(index=pd.Grouper(freq='M',key='date'),columns='city')

KeyError: 'The grouper name date is not found'

# <font color="PURPLE"><h3 align="center">9.Pandas Crosstab </h3></font>

In [66]:
import pandas as pd
df = pd.read_excel("survey.xls")
df

Unnamed: 0,Name,Nationality,Sex,Age,Handedness
0,Kathy,USA,Female,23,Right
1,Linda,USA,Female,18,Right
2,Peter,USA,Male,19,Right
3,John,USA,Male,22,Left
4,Fatima,Bangadesh,Female,31,Left
5,Kadir,Bangadesh,Male,25,Left
6,Dhaval,India,Male,35,Left
7,Sudhir,India,Male,31,Left
8,Parvir,India,Male,37,Right
9,Yan,China,Female,52,Right


In [67]:
pd.crosstab(df.Nationality,df.Handedness)

Handedness,Left,Right
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1
Bangadesh,2,0
China,2,1
India,2,1
USA,1,3


In [68]:
pd.crosstab(df.Sex,df.Handedness)

Handedness,Left,Right
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2,3
Male,5,2


<h2 style="color:purple">Margins</h2>

In [69]:
pd.crosstab(df.Sex,df.Handedness, margins=True)

Handedness,Left,Right,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,2,3,5
Male,5,2,7
All,7,5,12


<h2 style="color:purple">Multi Index Column and Rows</h2>

In [71]:
pd.crosstab(df.Sex, [df.Handedness,df.Nationality], margins=True)

Handedness,Left,Left,Left,Left,Right,Right,Right,All
Nationality,Bangadesh,China,India,USA,China,India,USA,Unnamed: 8_level_1
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Female,1,1,0,0,1,0,2,5
Male,1,1,2,1,0,1,1,7
All,2,2,2,1,1,1,3,12
