# [Pandas](https://pandas.pydata.org/)

**Data structures handling major use cases**

### Features of Pandas
- Powerful data structure
- Fast and efficient data wrangling
- Easy data aggregation and transformation
- Tools for reading and writing data
- Intelligent and automated data alignment
- High performance merging and joining of data sets

### Data Structures:
**Series**
- One-dimensional labeled array
- Supports multiple data types

**Data Frame**
- Two-dimensional labeled array
- Supports multiple data types
- Input can be a series
- Input can be another DataFrame

**Panel**
- Three-dimensional labeled array
- Supports multiple data types
- Items 🡪 axis 0
- Major axis 🡪 rows
- Minor axis🡪 columns

**Panel 4D (Experimental)**
- Four-dimensional labeled array
- Supports multiple data types
- Labels 🡪 axis 0
- Items 🡪 axis 1
- Major axis 🡪 rows
- Minor axis🡪 columns

#### Creating Series from a List

In [1]:
#Import libraries

import numpy as np
import pandas as pd

In [2]:
#Pass list as an argument

first_series = pd.Series(list('abcdef'))

In [3]:
print(first_series)

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object


above indexing or data alignment is done automatically

#### Creating Series from an ndarray

In [4]:
#list of countries with ndarray

np_country = np.array(['Luxembourg','Norway','Japan','Switzerland','United States','Qatar','Iceland','Sweden','Singapore','Denmark'])

In [5]:
#Pass ndarray as an argument

s_country = pd.Series(np_country)

In [6]:
print(s_country)

0       Luxembourg
1           Norway
2            Japan
3      Switzerland
4    United States
5            Qatar
6          Iceland
7           Sweden
8        Singapore
9          Denmark
dtype: object


#### Creating Series from dict

In [7]:
#Evaluate countries and their corresponding GDP per capita and print them as series

In [8]:
dict_country_gdp = pd.Series([52056.01781,40258.80862,40034.85063,39578.07441,39170.41371,37958.23146,37691.02733,36152.66676,34706.19047,33630.24604,33529.83052,30860.12808],index=['Luxembourg','Macao, China','Norway','Japan','Switzerland','Hong Kong, China','United States','Qatar','Iceland','Sweden','Singapore','Denmark'])

In [9]:
print(dict_country_gdp)

Luxembourg          52056.01781
Macao, China        40258.80862
Norway              40034.85063
Japan               39578.07441
Switzerland         39170.41371
Hong Kong, China    37958.23146
United States       37691.02733
Qatar               36152.66676
Iceland             34706.19047
Sweden              33630.24604
Singapore           33529.83052
Denmark             30860.12808
dtype: float64


#### Creating Series from Scalar

In [10]:
#print Series with scalar input

In [11]:
scalar_series = pd.Series(5,index=['a','b','c','d','e'])

In [12]:
scalar_series

a    5
b    5
c    5
d    5
e    5
dtype: int64

### Accessing Elements in Series
Data can be accessed through different functions like loc, iloc by passing data element position or index range.

In [13]:
# access elements in the series

dict_country_gdp[0]

52056.01781

In [14]:
# access first 5 countries from the series

dict_country_gdp[0:5]

Luxembourg      52056.01781
Macao, China    40258.80862
Norway          40034.85063
Japan           39578.07441
Switzerland     39170.41371
dtype: float64

In [15]:
# look up a country by name or index

dict_country_gdp.loc['United States']

37691.02733

In [16]:
# look up by position

dict_country_gdp.iloc[0]

52056.01781

### Vectorizing Operations in Series
Vectorized operations are performed by the data element’s position.

In [17]:
first_vector_series = pd.Series([1,2,3,4],index=['a','b','c','d'])
second_vector_series = pd.Series([10,20,30,40],index=['a','b','c','d'])

In [18]:
first_vector_series + second_vector_series

a    11
b    22
c    33
d    44
dtype: int64

In [19]:
second_vector_series = pd.Series([10,20,30,40],index=['a','b','c','d'])

In [20]:
first_vector_series + second_vector_series

a    11
b    22
c    33
d    44
dtype: int64

### DataFrame

#### Creating DataFrame from Lists

In [21]:
import pandas as pd

In [22]:
#list five olympics data: place, year and number of coimtries participated

olympic_data_list = {'HostCity':['London','Beijing','Athens','Sydney','Atlanta'],
                    'Year':[2012,2008,2004,2000,1996],
                    'No. of Patricipating Countries':[205,204,201,200,197]}

In [23]:
#Pass the list to the DataFrame

df_olympic_data = pd.DataFrame(olympic_data_list)

In [24]:
df_olympic_data

Unnamed: 0,HostCity,Year,No. of Patricipating Countries
0,London,2012,205
1,Beijing,2008,204
2,Athens,2004,201
3,Sydney,2000,200
4,Atlanta,1996,197


#### Viewing DataFrame
can view a DataFrame by referring to the column name or with the describe function.

In [25]:
#select by city name

df_olympic_data.HostCity

0     London
1    Beijing
2     Athens
3     Sydney
4    Atlanta
Name: HostCity, dtype: object

In [26]:
#use describe function to display the content

df_olympic_data.describe

<bound method NDFrame.describe of   HostCity  Year  No. of Patricipating Countries
0   London  2012                             205
1  Beijing  2008                             204
2   Athens  2004                             201
3   Sydney  2000                             200
4  Atlanta  1996                             197>

#### Creating DataFrame from dict of Series

In [27]:
olympic_series_participation = pd.Series([205,204,201,200,197],index=[2012,2008,2004,2000,1996])
olympic_series_country = pd.Series(['London','Beijing','Athens','Sydney','Atlanta'],index=[2012,2008,2004,2000,1996])

In [28]:
df_olympic_series = pd.DataFrame({'No. of Patricipating Countries':olympic_series_participation,
                                 'Host Cities':olympic_series_country})

In [29]:
df_olympic_series

Unnamed: 0,No. of Patricipating Countries,Host Cities
2012,205,London
2008,204,Beijing
2004,201,Athens
2000,200,Sydney
1996,197,Atlanta


#### Creating DataFrame from ndarray

In [30]:
import numpy as np

In [31]:
np_array = np.array([2012,2008,2004,2006])  #Create a ndarray with years
dict_ndarray = {'Year':np_array}  #Create a dict with the ndarray

In [32]:
df_ndarray = pd.DataFrame(dict_ndarray)   #Pass this dict to a new DataFrame

In [33]:
df_ndarray

Unnamed: 0,Year
0,2012
1,2008
2,2004
3,2006


#### Creating DataFrame from DataFrame Object

In [34]:
df_from_df = pd.DataFrame(df_olympic_series)

In [35]:
df_from_df

Unnamed: 0,No. of Patricipating Countries,Host Cities
2012,205,London
2008,204,Beijing
2004,201,Athens
2000,200,Sydney
1996,197,Atlanta


***

## Handling Missing Values
It’s difficult to operate a dataset when it has missing values or uncommon indices.

In [36]:
import pandas as pd

In [37]:
#declare first series

first_series = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])

In [38]:
##declare second series

second_series = pd.Series([10,20,30,40,50],index=['c','e','f','g','h'])

In [39]:
sum_of_series = first_series+second_series

In [40]:
sum_of_series

a     NaN
b     NaN
c    13.0
d     NaN
e    25.0
f     NaN
g     NaN
h     NaN
dtype: float64

In [41]:
#drop NaN (Not a Number) values from dataset

dropna_s = sum_of_series.dropna()

In [42]:
dropna_s

c    13.0
e    25.0
dtype: float64

In [43]:
#fill NaN values with Zeroes (0)

fillna_s = sum_of_series.fillna(0)

In [44]:
fillna_s

a     0.0
b     0.0
c    13.0
d     0.0
e    25.0
f     0.0
g     0.0
h     0.0
dtype: float64

In [45]:
#fill values with zeroes before performing addition operation from missing indices

fill_NaN_with_zeroes_before_sum = first_series.add(second_series,fill_value=0)

In [46]:
fill_NaN_with_zeroes_before_sum

a     1.0
b     2.0
c    13.0
d     4.0
e    25.0
f    30.0
g    40.0
h    50.0
dtype: float64

## Data Operation
Data operation can be performed through various built-in methods for faster data processing.

In [47]:
import pandas as pd

In [48]:
#declare movie rating dataframe: ratings from 1 to 5 (star ratings(*))

df_movie_rating = pd.DataFrame({'Movie 1':[5,4,3,3,2,1],
                               'Movie 2':[4,5,2,3,4,2]},
                               index=['Tom','Jeff','Peter','Ram','Ted','Paul'])

In [49]:
df_movie_rating

Unnamed: 0,Movie 1,Movie 2
Tom,5,4
Jeff,4,5
Peter,3,2
Ram,3,3
Ted,2,4
Paul,1,2


While performing data operation, custom functions can be applied using the **applymap** method.

In [50]:
#declare a custom function

def movie_grade(rating):
    if rating == 5:
        return 'A'
    if rating == 4:
        return 'B'
    if rating == 3:
        return 'C'
    else:
        return 'F'

In [51]:
#test the rating number as function parameter

print(movie_grade(5))

A


In [52]:
#Apply the function to the DataFrame

df_movie_rating.applymap(movie_grade)

Unnamed: 0,Movie 1,Movie 2
Tom,A,B
Jeff,B,A
Peter,C,F
Ram,C,C
Ted,F,B
Paul,F,F


### Data Operation with Statistical Functions

##### Example

In [53]:
df_test_score = pd.DataFrame({'Test 1':[95,84,73,88,82,61],
                             'Test 2':[74,85,82,73,77,79]},
                            index=['Jack','Lewis','Patrick','Rich','Kelly','Paula'])

In [54]:
#Apply the max function to find the maximum score

df_test_score.max()

Test 1    95
Test 2    85
dtype: int64

In [55]:
#Apply the mean function to find the average score

df_test_score.mean()

Test 1    80.500000
Test 2    78.333333
dtype: float64

In [56]:
#Apply the std function to find the standard deviation for both the tests

df_test_score.std()

Test 1    11.979149
Test 2     4.633213
dtype: float64

### Data Operation Using Groupby

##### Example

In [57]:
#Create a DataFrame with first and last name as former presidents

df_president_name = pd.DataFrame({'First':['George','Bill','Ronald','Jimmy','George'],
                                 'Last':['Bush','Clinton','Regan','Carter','Washington']})

In [58]:
df_president_name

Unnamed: 0,First,Last
0,George,Bush
1,Bill,Clinton
2,Ronald,Regan
3,Jimmy,Carter
4,George,Washington


In [59]:
#Group the DataFrame with the first name

grouped = df_president_name.groupby('First')

In [60]:
grp_data = grouped.get_group('George')

In [61]:
grp_data

Unnamed: 0,First,Last
0,George,Bush
4,George,Washington


In [62]:
#Sort values by first name

df_president_name.sort_values('First')

Unnamed: 0,First,Last
1,Bill,Clinton
0,George,Bush
4,George,Washington
3,Jimmy,Carter
2,Ronald,Regan


## Data Standardization

In [63]:
#Create a function to return the standardize value

def standardize_tests(test):
    return (test-test.mean())/test.std()

In [64]:
standardize_tests(df_test_score['Test 1'])

Jack       1.210437
Lewis      0.292174
Patrick   -0.626088
Rich       0.626088
Kelly      0.125218
Paula     -1.627829
Name: Test 1, dtype: float64

In [67]:
#Apply the function to the entire dataset

def standardize_test_score(datafrm):
    return datafrm.apply(standardize_tests)

In [68]:
standardize_test_score(df_test_score)

Unnamed: 0,Test 1,Test 2
Jack,1.210437,-0.935276
Lewis,0.292174,1.438886
Patrick,-0.626088,0.791387
Rich,0.626088,-1.151109
Kelly,0.125218,-0.287777
Paula,-1.627829,0.143889
