<h2 style = "color : Brown"> Operations on Pandas</h2>

This notebook will cover the following topics: 
* Filtering dataframes 
    * Single and multiple conditions
* Creating new columns
* Lambda functions 
* Group by and aggregate functions
* Pivot data
* Merging data frames
    * Joins and concatenations

<h4 style = "color : Sky blue"> Preparatory steps</h4>  

##### Background

An FMCG company P&J found that the sales of their best selling items are affected by the weather and rainfall trend. For example, the sale of tea increases when it rains, sunscreen is sold on the days when it is least likely to rain, and the sky is clear. They would like to check whether the weather patterns play a vital role in the sale of certain items. Hence as initial experimentation, they would like you to forecast the weather trend in the upcoming days. The target region for this activity is Australia; accordingly, this exercise will be based on analysing and cleaning the weather data from the Australian region available on public platforms.  

##### Read the data into a dataframe

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("weatherdata.csv", header =0)

##### Display the data 

In [3]:
data.head(5)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0


##### Data Dictionary 

1. Date: The date on which the recording was taken
2. Location: The location of the recording
3. MinTemp: Minimum temperature on the day of the recording (in C)
4. MaxTemp: Maximum temperature in the day of the recording (in C)
5. Rainfall: Rainfall in mm
6. Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am
7. Sunshine: The number of hours of bright sunshine in the day.
8. WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight
9. WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight

<h4 style = "color : Sky blue"> Example 1.1: Filtering dataframes</h4>

Find the days which had sunshine for more that 4 hours. These days will have increased sales of sunscreen. 

In [4]:
data.shape

(142193, 9)

In [5]:
data["Sunshine"]>4

0         False
1         False
2         False
3         False
4         False
          ...  
142188    False
142189    False
142190    False
142191    False
142192    False
Name: Sunshine, Length: 142193, dtype: bool

In [6]:
data[data["Sunshine"]>4]

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed
5939,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,SSW,48.0
5940,2009-01-02,Cobar,18.4,28.9,0.0,14.8,13.0,S,37.0
5941,2009-01-03,Cobar,15.5,34.1,0.0,12.6,13.3,SE,30.0
5942,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,NNE,46.0
5943,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,WNW,31.0
...,...,...,...,...,...,...,...,...,...
139108,2017-06-20,Darwin,19.3,33.4,0.0,6.0,11.0,ENE,35.0
139109,2017-06-21,Darwin,21.2,32.6,0.0,7.6,8.6,E,37.0
139110,2017-06-22,Darwin,20.7,32.8,0.0,5.6,11.0,E,33.0
139111,2017-06-23,Darwin,19.5,31.8,0.0,6.2,10.6,ESE,26.0


**Note:** High sunshine corresponds to low rainfall. 

<h4 style = "color : Sky blue"> Example 1.2: Filtering dataframes</h4>

The cold drink sales will most likely increase on the days which have high sunshine(>5) and high max temperature(>35). Use the filter operation to filter out these days

In [7]:
data[(data["MaxTemp"]>35) & (data["Sunshine"]>5)]

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed
5939,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,SSW,48.0
5942,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,NNE,46.0
5943,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,WNW,31.0
5944,2009-01-06,Cobar,24.2,41.0,0.0,11.2,8.4,WNW,35.0
5948,2009-01-10,Cobar,19.0,35.5,0.0,12.0,12.3,ENE,48.0
...,...,...,...,...,...,...,...,...,...
138862,2016-10-17,Darwin,25.1,35.2,0.0,7.4,11.5,NNE,39.0
138879,2016-11-03,Darwin,24.4,35.5,0.0,7.8,9.9,NW,35.0
138892,2016-11-16,Darwin,25.7,35.2,0.0,5.4,11.3,NW,26.0
138905,2016-11-29,Darwin,25.8,35.1,0.8,4.8,6.4,SSE,46.0


**Note:** The construction of the filter condition, it has individual filter conditions separated in parenthesis

<h4 style = "color : Sky blue"> Example 2.1: Creating new columns</h4>
    
If you noticed the filtering done in the earlier examples did not give precise information about the days, the data column simply has the dates. The date column can be split into the year, month and day of the month. 

**Special module of pandas** The "DatetimeIndex" is a particular module which has the capabilities to extract a day, month and year form the date. 

In [8]:
pd.DatetimeIndex(data["Date"])

DatetimeIndex(['2008-12-01', '2008-12-02', '2008-12-03', '2008-12-04',
               '2008-12-05', '2008-12-06', '2008-12-07', '2008-12-08',
               '2008-12-09', '2008-12-10',
               ...
               '2017-06-15', '2017-06-16', '2017-06-17', '2017-06-18',
               '2017-06-19', '2017-06-20', '2017-06-21', '2017-06-22',
               '2017-06-23', '2017-06-24'],
              dtype='datetime64[ns]', name='Date', length=142193, freq=None)

In [9]:
pd.DatetimeIndex(data["Date"]).year

Int64Index([2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,
            ...
            2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017],
           dtype='int64', name='Date', length=142193)

**Adding New columns** To add a new column in the dataframe just name the column and pass the instructions about the creation of the new column 

In [None]:
data["Year"] = pd.DatetimeIndex(data["Date"]).year

In [10]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0


In [11]:
data["Month"] = pd.DatetimeIndex(data["Date"]).month

In [12]:
data["Dayofmonth"] = pd.DatetimeIndex(data["Date"]).day

In [14]:
data.head(20)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,12,1,73.22
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,12,2,77.18
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,12,3,78.26
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,12,4,82.4
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,12,5,90.14
5,2008-12-06,Albury,14.6,29.7,0.2,,,WNW,56.0,12,6,85.46
6,2008-12-07,Albury,14.3,25.0,0.0,,,W,50.0,12,7,77.0
7,2008-12-08,Albury,7.7,26.7,0.0,,,W,35.0,12,8,80.06
8,2008-12-09,Albury,9.7,31.9,0.0,,,NNW,80.0,12,9,89.42
9,2008-12-10,Albury,13.1,30.1,1.4,,,W,28.0,12,10,86.18


<h4 style = "color : Sky blue"> Example 2.2: Creating new columns</h4>

The temperature given is in Celcius, convert it in Fahrenheit and store it in a new column for it. 

In [13]:
data["Maxtemp_F"] = data["MaxTemp"] * 9/5 +32 

In [15]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,12,1,73.22
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,12,2,77.18
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,12,3,78.26
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,12,4,82.4
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,12,5,90.14


<h4 style = "color : Sky blue"> Example 3.1: Lambda Functions</h4>

Let's create a new column which highlights the days which have rainfall more than 50 mm as rainy days and the rest are not.

In [16]:
data.Rainfall

0         0.6
1         0.0
2         0.0
3         0.0
4         1.0
         ... 
142188    0.0
142189    0.0
142190    0.0
142191    0.0
142192    0.0
Name: Rainfall, Length: 142193, dtype: float64

In [17]:
data.Rainfall.apply(lambda x: "Rainy" if x > 50  else "Not rainy")

0         Not rainy
1         Not rainy
2         Not rainy
3         Not rainy
4         Not rainy
            ...    
142188    Not rainy
142189    Not rainy
142190    Not rainy
142191    Not rainy
142192    Not rainy
Name: Rainfall, Length: 142193, dtype: object

**Note** 
1. New way of accessing a column in a dataframe by using the dot operator.
2. "apply" function takes in a lambda operator as argument. 

In [18]:
type(data.Rainfall)

pandas.core.series.Series

In [19]:
type(data["Rainfall"])

pandas.core.series.Series

In [20]:
data["is_raining"] = data.Rainfall.apply(lambda x: "Rainy" if x > 50  else "Not rainy")

In [21]:
## Note that the above code is also another way to find this
## data["is_raining"] = data[Rainfall]apply(lambda x: "Rainy" if x > 50  else "Not rainy")

In [22]:
data[data["is_raining"] == "Rainy"]

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining
429,2010-02-05,Albury,19.2,26.1,52.2,,,SE,33.0,2,5,78.98,Rainy
455,2010-03-08,Albury,18.1,25.5,66.0,,,NW,56.0,3,8,77.90,Rainy
690,2010-10-31,Albury,13.8,18.7,50.8,,,NNW,52.0,10,31,65.66,Rainy
704,2010-11-14,Albury,19.2,22.6,52.6,,,N,26.0,11,14,72.68,Rainy
787,2011-02-05,Albury,20.4,23.0,99.2,,,NW,28.0,2,5,73.40,Rainy
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140532,2017-02-03,Katherine,23.4,33.0,62.0,,,NNW,33.0,2,3,91.40,Rainy
140571,2017-03-14,Katherine,23.0,35.0,79.0,31.0,,ESE,22.0,3,14,95.00,Rainy
140578,2017-03-22,Katherine,24.1,34.5,61.4,,,N,31.0,3,22,94.10,Rainy
142013,2016-12-26,Uluru,22.1,27.4,83.8,,,ENE,72.0,12,26,81.32,Rainy



<h4 style = "color : Sky blue"> Example 4.1: Grouping and Aggregate functions</h4>

Find the location which received the most amount of rain in the given data. In this place, certain promotional offers can be put in place to boost sales of tea, umbrella etc.  

In [23]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,12,1,73.22,Not rainy
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,12,2,77.18,Not rainy
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,12,3,78.26,Not rainy
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,12,4,82.4,Not rainy
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,12,5,90.14,Not rainy


In [24]:
data.tail()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining
142188,2017-06-20,Uluru,3.5,21.8,0.0,,,E,31.0,6,20,71.24,Not rainy
142189,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,6,21,74.12,Not rainy
142190,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,6,22,77.54,Not rainy
142191,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,6,23,80.42,Not rainy
142192,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,6,24,80.6,Not rainy


In [28]:
data_bylocation = data.groupby(by = ['Location']).mean(numeric_only=True)
data_bylocation.head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Month,Dayofmonth,Maxtemp_F
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Adelaide,12.628368,22.945402,1.572185,5.824924,7.752002,36.530812,6.523948,15.740453,73.301723
Albany,12.948461,20.072587,2.255073,4.207273,6.658765,,6.41313,15.680371,68.130657
Albury,9.520899,22.630963,1.92571,,,32.953016,6.412488,15.745932,72.735734
AliceSprings,13.125182,29.244191,0.869355,9.029929,9.581944,40.533714,6.407456,15.689211,84.639545
BadgerysCreek,11.1369,24.023111,2.207925,,,33.60989,6.326161,15.769467,75.2416


In [26]:
data_bylocation.sort_values('Rainfall', ascending = False).head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Month,Dayofmonth,Maxtemp_F
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Cairns,21.199197,29.544344,5.765317,6.211976,7.575995,38.067991,6.363454,15.720214,85.179819
Darwin,23.21053,32.540977,5.094048,6.319089,8.49931,40.582355,6.534461,15.716792,90.573759
CoffsHarbour,14.365774,23.915575,5.054592,3.904267,7.362374,39.232197,6.392482,15.716898,75.048035
GoldCoast,17.34149,25.752971,3.728933,,,42.472539,6.435906,15.717114,78.355347
Wollongong,14.949058,21.47651,3.589127,,,45.695257,6.423734,15.694268,70.657718


<h4 style = "color : Sky blue"> Example 4.2: Grouping and Aggregate functions</h4>

Hot chocolate is the most sold product in the cold months. Find month which is the coldest so that the inventory team can keep the stock of hot chocolate ready well in advance. 

In [29]:
data_bymonth = data.groupby(by = ['Month']).mean(numeric_only=True)
data_bymonth

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Dayofmonth,Maxtemp_F
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,17.520778,29.547362,2.719036,8.773171,9.208942,43.36173,15.986688,85.185252
2,17.500239,28.877704,3.174075,7.651018,8.607494,41.457472,14.643515,83.979867
3,15.904347,26.886744,2.801304,6.237989,7.646279,39.546399,15.995321,80.396138
4,12.831979,23.611845,2.314764,4.547511,7.107208,36.460285,15.492659,74.50132
5,9.618572,20.047202,1.978896,3.244134,6.337496,35.721056,15.991038,68.084964
6,7.815031,17.324778,2.781114,2.518705,5.660379,35.506375,15.257648,63.1846
7,6.951308,16.764242,2.179314,2.699269,6.06979,37.891458,16.001528,62.175636
8,7.465145,18.25893,2.02961,3.616533,7.171661,40.245052,16.022275,64.866074
9,9.460189,20.77251,1.875851,4.917265,7.69877,42.213311,15.518378,69.390517
10,11.531145,23.540695,1.610734,6.379571,8.50008,42.716694,16.026771,74.373252


In [30]:
data_bymonth.sort_values('MinTemp')

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Dayofmonth,Maxtemp_F
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7,6.951308,16.764242,2.179314,2.699269,6.06979,37.891458,16.001528,62.175636
8,7.465145,18.25893,2.02961,3.616533,7.171661,40.245052,16.022275,64.866074
6,7.815031,17.324778,2.781114,2.518705,5.660379,35.506375,15.257648,63.1846
9,9.460189,20.77251,1.875851,4.917265,7.69877,42.213311,15.518378,69.390517
5,9.618572,20.047202,1.978896,3.244134,6.337496,35.721056,15.991038,68.084964
10,11.531145,23.540695,1.610734,6.379571,8.50008,42.716694,16.026771,74.373252
4,12.831979,23.611845,2.314764,4.547511,7.107208,36.460285,15.492659,74.50132
11,14.299624,26.165571,2.273758,7.465236,8.685394,42.582385,15.498211,79.098028
12,15.771514,27.52639,2.476483,8.046298,8.975372,43.004769,15.969103,81.547503
3,15.904347,26.886744,2.801304,6.237989,7.646279,39.546399,15.995321,80.396138


<h4 style = "color : Sky blue"> Example 4.3: Grouping and Aggregate functions</h4>

Sometimes feeling cold is more than about low temperatures; a windy day can also make you cold. A factor called the chill factor can be used to quantify the cold based on the wind speed and the temperature. The formula for the chill factor is given by 


$ WCI = (10 * \sqrt{v} - v + 10.5) .(33 - T_{m}) $

v is the speed of the wind and $ T_{m} $ is the minimum temperature

Add a column for WCI and find the month with the lowest WCI. 

In [31]:
from math import sqrt
def wci(x):
    velocity = x['WindGustSpeed']
    minTemp = x['MinTemp']
    return ((10 * sqrt(velocity) - velocity + 10.5)*(33-minTemp))

In [32]:
data['WCI'] = data.apply(wci,axis=1)

In [33]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,12,1,73.22,Not rainy,643.516918
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,12,2,77.18,Not rainy,840.511893
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,12,3,78.26,Not rainy,649.698327
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,12,4,82.4,Not rainy,844.657118
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,12,5,90.14,Not rainy,519.734257


In [35]:
data_bymonth = data.groupby(by = ['Month']).mean(numeric_only=True)
data_bymonth

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Dayofmonth,Maxtemp_F,WCI
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,17.520778,29.547362,2.719036,8.773171,9.208942,43.36173,15.986688,85.185252,504.169996
2,17.500239,28.877704,3.174075,7.651018,8.607494,41.457472,14.643515,83.979867,511.722359
3,15.904347,26.886744,2.801304,6.237989,7.646279,39.546399,15.995321,80.396138,570.372892
4,12.831979,23.611845,2.314764,4.547511,7.107208,36.460285,15.492659,74.50132,680.79184
5,9.618572,20.047202,1.978896,3.244134,6.337496,35.721056,15.991038,68.084964,787.434259
6,7.815031,17.324778,2.781114,2.518705,5.660379,35.506375,15.257648,63.1846,845.755217
7,6.951308,16.764242,2.179314,2.699269,6.06979,37.891458,16.001528,62.175636,863.519699
8,7.465145,18.25893,2.02961,3.616533,7.171661,40.245052,16.022275,64.866074,836.501471
9,9.460189,20.77251,1.875851,4.917265,7.69877,42.213311,15.518378,69.390517,762.816683
10,11.531145,23.540695,1.610734,6.379571,8.50008,42.716694,16.026771,74.373252,697.875616


In [36]:
data_bymonth.sort_values('WCI', ascending = False)

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Dayofmonth,Maxtemp_F,WCI
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7,6.951308,16.764242,2.179314,2.699269,6.06979,37.891458,16.001528,62.175636,863.519699
6,7.815031,17.324778,2.781114,2.518705,5.660379,35.506375,15.257648,63.1846,845.755217
8,7.465145,18.25893,2.02961,3.616533,7.171661,40.245052,16.022275,64.866074,836.501471
5,9.618572,20.047202,1.978896,3.244134,6.337496,35.721056,15.991038,68.084964,787.434259
9,9.460189,20.77251,1.875851,4.917265,7.69877,42.213311,15.518378,69.390517,762.816683
10,11.531145,23.540695,1.610734,6.379571,8.50008,42.716694,16.026771,74.373252,697.875616
4,12.831979,23.611845,2.314764,4.547511,7.107208,36.460285,15.492659,74.50132,680.79184
11,14.299624,26.165571,2.273758,7.465236,8.685394,42.582385,15.498211,79.098028,612.435126
3,15.904347,26.886744,2.801304,6.237989,7.646279,39.546399,15.995321,80.396138,570.372892
12,15.771514,27.52639,2.476483,8.046298,8.975372,43.004769,15.969103,81.547503,561.241935


In [37]:
#The month with the lowest WCI
data_bymonth.sort_values('WCI')

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Dayofmonth,Maxtemp_F,WCI
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,17.520778,29.547362,2.719036,8.773171,9.208942,43.36173,15.986688,85.185252,504.169996
2,17.500239,28.877704,3.174075,7.651018,8.607494,41.457472,14.643515,83.979867,511.722359
12,15.771514,27.52639,2.476483,8.046298,8.975372,43.004769,15.969103,81.547503,561.241935
3,15.904347,26.886744,2.801304,6.237989,7.646279,39.546399,15.995321,80.396138,570.372892
11,14.299624,26.165571,2.273758,7.465236,8.685394,42.582385,15.498211,79.098028,612.435126
4,12.831979,23.611845,2.314764,4.547511,7.107208,36.460285,15.492659,74.50132,680.79184
10,11.531145,23.540695,1.610734,6.379571,8.50008,42.716694,16.026771,74.373252,697.875616
9,9.460189,20.77251,1.875851,4.917265,7.69877,42.213311,15.518378,69.390517,762.816683
5,9.618572,20.047202,1.978896,3.244134,6.337496,35.721056,15.991038,68.084964,787.434259
8,7.465145,18.25893,2.02961,3.616533,7.171661,40.245052,16.022275,64.866074,836.501471


<h4 style = "color : Sky blue"> Example 5.1: Merging Dataframes</h4>

The join command is used to combine dataframes. Unlike hstack and vstack, the join command works by using a key to combine to dataframes. 

For example the total tea for the Newcastle store for the month of June 2011 is given in the file names ```junesales.csv``` Read in the data from the file and join it to the weather data exracted from the original dataframe. 

In [42]:
sales = pd.read_csv("junesales.csv", header = 0)
sales.head()

Unnamed: 0,Date,Tea_sales(in 100's)
0,6/1/2011,26
1,6/2/2011,35
2,6/3/2011,37
3,6/4/2011,33
4,6/5/2011,25


In [43]:
sales["Dayofmonth"] = pd.DatetimeIndex(sales["Date"]).day
sales.head()

Unnamed: 0,Date,Tea_sales(in 100's),Dayofmonth
0,6/1/2011,26,1
1,6/2/2011,35,2
2,6/3/2011,37,3
3,6/4/2011,33,4
4,6/5/2011,25,5


In [44]:
data["Month"] = pd.DatetimeIndex(data["Date"]).day
data["Year"] = pd.DatetimeIndex(data["Date"]).year
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI,Year
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,1,1,73.22,Not rainy,643.516918,2008
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,2,2,77.18,Not rainy,840.511893,2008
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,3,3,78.26,Not rainy,649.698327,2008
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,4,4,82.4,Not rainy,844.657118,2008
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,5,5,90.14,Not rainy,519.734257,2008


In [45]:
# Filter the sales data for the relevant month and the appropriate location to a new dataframe. 

Newcastle_data = data[(data['Location']=='Newcastle') & (data['Year']==2011) & (data['Month']==6)]
Newcastle_data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI,Year
15494,2011-01-06,Newcastle,18.8,25.6,7.4,,,,,6,6,78.08,Not rainy,,2011
15525,2011-02-06,Newcastle,22.7,38.7,14.8,,,,,6,6,101.66,Not rainy,,2011
15553,2011-03-06,Newcastle,18.6,24.0,0.0,,,,,6,6,75.2,Not rainy,,2011
15579,2011-05-06,Newcastle,12.8,19.8,0.6,,,,,6,6,67.64,Not rainy,,2011
15610,2011-06-06,Newcastle,8.2,19.5,0.0,,,,,6,6,67.1,Not rainy,,2011


In [46]:
merge_data = Newcastle_data.merge(sales, on = "Dayofmonth")
merge_data.head(30)

Unnamed: 0,Date_x,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI,Year,Date_y,Tea_sales(in 100's)
0,2011-01-06,Newcastle,18.8,25.6,7.4,,,,,6,6,78.08,Not rainy,,2011,6/6/2011,35
1,2011-02-06,Newcastle,22.7,38.7,14.8,,,,,6,6,101.66,Not rainy,,2011,6/6/2011,35
2,2011-03-06,Newcastle,18.6,24.0,0.0,,,,,6,6,75.2,Not rainy,,2011,6/6/2011,35
3,2011-05-06,Newcastle,12.8,19.8,0.6,,,,,6,6,67.64,Not rainy,,2011,6/6/2011,35
4,2011-06-06,Newcastle,8.2,19.5,0.0,,,,,6,6,67.1,Not rainy,,2011,6/6/2011,35
5,2011-07-06,Newcastle,9.4,17.2,0.0,,,,,6,6,62.96,Not rainy,,2011,6/6/2011,35
6,2011-08-06,Newcastle,6.8,23.7,0.0,,,,,6,6,74.66,Not rainy,,2011,6/6/2011,35
7,2011-10-06,Newcastle,12.1,18.3,0.0,,,,,6,6,64.94,Not rainy,,2011,6/6/2011,35
8,2011-11-06,Newcastle,15.2,32.4,0.0,,,,,6,6,90.32,Not rainy,,2011,6/6/2011,35
9,2011-12-06,Newcastle,11.8,20.4,1.6,,,,,6,6,68.72,Not rainy,,2011,6/6/2011,35


<h4 style = "color : Sky blue"> Example 5.2: Merging Dataframes</h4>

##### Types of joins. 

* INNER JOIN
![](1.png)

* LEFT JOIN
![](2.png)

* RIGHT JOIN
![](5.png)

* FULL JOIN
![](4.png)


Each state may have different tax laws, so we might want to add the states information to the data as well.

The file ```locationsandstates.csv``` information about the states and location, the data in this file is **not** same as the weather data. It is possible that few locations in "data" (original dataframe) are not in this file, and all the locations in the file might not be in the original dataframe. 

In the original dataframe add the state data. 

In [47]:
state = pd.read_csv("locationsandstates.csv", header = 0)
state

Unnamed: 0,Location,State
0,Sydney,New South Wales
1,Albury,New South Wales
2,Armidale,New South Wales
3,Bathurst,New South Wales
4,Blue Mountains,New South Wales
...,...,...
71,Joondalup,Western Australia
72,Kalgoorlie,Western Australia
73,Karratha,Western Australia
74,Mandurah,Western Australia


In [48]:
state_data = data.merge(state, on = "Location", how = "left")
state_data

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI,Year,State
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,1,1,73.22,Not rainy,643.516918,2008,New South Wales
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,2,2,77.18,Not rainy,840.511893,2008,New South Wales
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,3,3,78.26,Not rainy,649.698327,2008,New South Wales
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,4,4,82.40,Not rainy,844.657118,2008,New South Wales
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,5,5,90.14,Not rainy,519.734257,2008,New South Wales
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,2017-06-20,Uluru,3.5,21.8,0.0,,,E,31.0,20,20,71.24,Not rainy,1037.740487,2017,
142189,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,21,21,74.12,Not rainy,1062.364838,2017,
142190,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,22,22,77.54,Not rainy,1040.882233,2017,
142191,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,23,23,80.42,Not rainy,947.442458,2017,


<h4 style = "color : Sky blue"> Example 6.1: pivot tables</h4>

Using pivot tables find the average monthly rainfall in the year 2016 of all the locations. The information can then be used to predict the sales of tea in the year 2017.  

In [49]:
data_2016 = data[data["Year"] ==2016]
data_2016

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,Month,Dayofmonth,Maxtemp_F,is_raining,WCI,Year
2474,2016-01-01,Albury,20.4,37.6,0.0,,,ENE,54.0,1,1,99.68,Not rainy,377.807123,2016
2475,2016-01-02,Albury,20.9,33.6,0.4,,,SSE,50.0,2,2,92.48,Not rainy,377.649205,2016
2476,2016-01-03,Albury,18.4,23.1,2.2,,,ENE,48.0,3,3,73.58,Not rainy,464.017672,2016
2477,2016-01-04,Albury,17.3,23.7,15.6,,,SSE,39.0,4,4,74.66,Not rainy,533.014686,2016
2478,2016-01-05,Albury,15.5,22.9,6.8,,,ENE,31.0,5,5,73.22,Not rainy,615.608763,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142014,2016-12-27,Uluru,22.1,35.8,63.8,,,WNW,43.0,27,27,96.44,Rainy,360.510799,2016
142015,2016-12-28,Uluru,22.6,36.8,0.0,,,NW,50.0,28,28,98.24,Not rainy,324.591052,2016
142016,2016-12-29,Uluru,23.2,38.0,0.0,,,SSW,33.0,29,29,100.40,Not rainy,342.467139,2016
142017,2016-12-30,Uluru,19.7,37.0,0.0,,,E,37.0,30,30,98.60,Not rainy,456.557417,2016


In [50]:
data_2016.pivot_table(index = "Location", columns = "Month", values = "Rainfall", aggfunc='mean')

Month,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Adelaide,1.65,1.4,4.333333,1.55,3.533333,1.55,2.083333,1.116667,3.15,4.233333,...,1.8,1.816667,1.9,1.583333,2.766667,1.8,6.533333,5.35,3.6,0.828571
Albany,1.666667,0.883333,2.0,1.841667,0.8,0.15,2.383333,2.05,2.875,4.25,...,3.725,3.066667,2.34,2.88,1.163636,3.618182,3.390909,2.758333,1.272727,3.971429
Albury,4.766667,1.916667,4.033333,3.3,2.266667,0.75,0.333333,1.266667,5.05,3.55,...,2.016667,5.85,0.4,1.633333,1.616667,3.033333,0.516667,2.183333,2.636364,5.228571
AliceSprings,0.183333,0.233333,0.333333,0.683333,0.0,0.0,2.15,2.3,2.566667,0.4,...,0.833333,0.0,0.5,0.716667,2.366667,4.383333,0.163636,0.345455,0.42,5.066667
BadgerysCreek,0.236364,1.036364,3.454545,1.163636,9.872727,9.945455,0.290909,0.254545,0.945455,0.545455,...,2.466667,1.75,0.383333,2.95,0.15,0.516667,0.233333,1.966667,5.02,5.628571
Ballarat,3.683333,0.716667,3.5,1.183333,0.8,1.216667,0.716667,2.32,5.5,4.05,...,3.166667,2.25,1.816667,1.266667,1.883333,1.25,0.833333,1.683333,1.454545,2.8
Bendigo,3.45,1.233333,1.933333,2.466667,1.783333,0.1,1.416667,1.4,4.581818,2.133333,...,3.716667,2.666667,1.0,1.05,1.763636,1.436364,0.9,1.6,3.309091,4.028571
Brisbane,1.666667,3.116667,3.327273,8.727273,9.04,1.4,1.85,1.833333,0.916667,1.583333,...,0.8,0.483333,1.166667,0.816667,0.816667,0.6,0.283333,0.233333,0.872727,4.142857
Cairns,0.333333,1.111111,1.044444,12.78,0.927273,1.181818,1.509091,2.666667,2.24,2.22,...,0.94,19.68,8.04,2.155556,12.6,0.781818,14.222222,0.155556,1.05,2.142857
Canberra,1.883333,1.35,5.216667,1.7,4.683333,8.65,0.05,0.066667,3.633333,2.583333,...,5.033333,2.1,0.666667,3.783333,0.566667,0.883333,0.233333,1.15,2.618182,6.228571


Find the Pandas pivot table documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html)

This information can be used to decide the stocks of tea in each of the stores. 

You can modify the pivot_table command to get a lot of work done quickly.

In [51]:
data_2016.pivot_table(index = "Location", columns = "Month", values = "Sunshine", aggfunc='mean')

Month,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albany,7.1,9.166667,4.733333,6.65,7.2,1.833333,6.566667,10.766667,8.75,4.9,...,4.1,4.533333,3.7,1.8,5.233333,10.033333,4.8,7.45,6.25,3.85
AliceSprings,11.7,11.05,9.5,11.366667,12.2,10.7,7.7,10.275,8.875,11.45,...,11.4,10.55,11.575,11.333333,10.7,10.666667,9.2,6.733333,11.95,9.8
Brisbane,7.558333,8.316667,7.009091,8.045455,8.127273,9.183333,8.733333,9.216667,8.05,8.291667,...,9.266667,7.583333,9.458333,9.241667,7.5,7.775,9.383333,8.75,9.027273,8.271429
Cairns,7.875,9.3,7.2,7.7,7.4,10.5,8.6,7.35,6.2,3.8,...,9.1,5.6,7.25,5.425,6.025,7.825,9.1,9.666667,9.65,9.5
Dartmoor,8.0,4.266667,9.566667,10.4,9.225,9.525,5.633333,9.025,7.875,7.075,...,6.3,3.5,2.875,8.1,6.025,5.35,2.3,4.8,6.05,3.9
Darwin,8.85,8.983333,8.25,8.825,9.433333,8.625,9.983333,9.016667,9.791667,9.391667,...,8.125,9.091667,9.241667,9.233333,8.05,8.7,8.025,8.683333,7.945455,8.5
Hobart,6.308333,7.433333,5.783333,7.266667,5.625,6.158333,6.158333,7.15,6.783333,5.75,...,8.025,5.616667,8.175,7.391667,6.408333,6.741667,6.258333,4.25,5.190909,6.533333
Melbourne,5.0875,6.2875,7.4875,5.35,4.225,6.6125,5.2125,6.0375,4.65,5.1125,...,3.775,6.6125,7.225,6.3125,2.7625,5.0625,6.325,6.3,5.322222,9.3
MelbourneAirport,6.116667,6.975,7.566667,5.825,5.216667,6.741667,6.066667,6.683333,5.191667,5.808333,...,4.175,6.283333,6.641667,7.191667,4.216667,5.058333,5.783333,4.958333,4.972727,9.014286
Mildura,7.258333,7.083333,8.83,7.47,8.55,8.418182,8.79,7.67,7.76,7.6,...,7.533333,8.972727,8.55,8.936364,8.54,7.881818,6.872727,6.236364,7.2,8.2


##### Note

[Here](https://pandas.pydata.org/pandas-docs/stable/index.html) is the link to the official documentation of Pandas. Be sure to visit it inorder to explore to availability of functions in the library. 