<h1>Pandas</h1>

<li>Integrated data manipulation and analysis capabilities
<li>Integration with data visualization libraries
<li>Integration with machine learning libraries
<li>Built in time-series capabilities (Pandas was originally designed for financial time series data)
<li>Optimized for speed
<li>Built-in support for grabbing data from multiple sources csv, xls, html tables, yahoo, google, worldbank, FRED
<li>Integrated data manipulation support (messy data, missing data, feature construction)
<li><b>End to end support for data manipulation, data visualization, data analysis, and presenting results</b>

<h2>The <span style="color:blue">apply</span> function</h2>
<li><span style="color:blue">apply</span> applies a function to all elements along a specified axis</li>
<li><b>Example</b>: divide the salaries into "High", "Medium", "Low" groups</li>
<li>The axis argument tells pandas to go row by row (axis=1) or column by column (axis=0, default)</li>
<li>Note that the supplied lambda function must make sense along the axis</li>
<li>The apply function is useful for <i>feature engineering</i></li>

<li><b>axis=1</b>: operates row by row. Each x in the lambda function is the row as a series with column names in the index</li>
<li><b>axis=0</b>: operates column by column. Each x in the lambda function is the column as a series with row names in the index</li>
<li><b>Note</b>: Column data is always of the same type. Row data can be of mixed types</li>


In [2]:
import numpy as np
import pandas as pd
emp_id = np.array([100,101,102,103,104,105,106,107,108,109,110,111])
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit",
                  "Bercù","Elodie","Gurumul","Kwame","Rosa","João"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0,
                  321000.23,37345.22,121200,59621.33,94123.5,45123.2])
department = np.array(['1','2','1','2','1','1','1','2',"1","2","1","1"])
city = np.array(["New York","Catania","Paris","New York","Sydney","Sydney",
                 "Paris","New York","Sydney","Paris","New York","Paris"])
salary = np.array([455000,722321,95223,135000,132033,700000,832123,
                   78123.11,13243.32,456122.17,912321.22,31123])
columns=["employee","department","city","salary","bonus"]


df = pd.DataFrame([names,department,city,salary,bonus]).transpose().set_index(emp_id)

df.columns = columns

df['salary'] =df['salary'].astype('float64',copy=True)
df['bonus'] =df['bonus'].astype('float64',copy=True)

df

Unnamed: 0,employee,department,city,salary,bonus
100,Bill,1,New York,455000.0,232300.56
101,Ludovica,2,Catania,722321.0,478123.45
102,Qing,1,Paris,95223.0,3891.24
103,Savitri,2,New York,135000.0,98012.36
104,Giovanni,1,Sydney,132033.0,52123.5
105,Birgit,1,Sydney,700000.0,0.0
106,Bercù,1,Paris,832123.0,321000.23
107,Elodie,2,New York,78123.11,37345.22
108,Gurumul,1,Sydney,13243.32,121200.0
109,Kwame,2,Paris,456122.17,59621.33


In [2]:
df['categorical_salary'] = df.apply(lambda x: "High" if x.salary>200000 else "Medium" if x.salary>100000 else "Low",
      axis=1)
df

Unnamed: 0,employee,department,city,salary,bonus,categorical_salary
100,Bill,1,New York,455000.0,232300.56,High
101,Ludovica,2,Catania,722321.0,478123.45,High
102,Qing,1,Paris,95223.0,3891.24,Low
103,Savitri,2,New York,135000.0,98012.36,Medium
104,Giovanni,1,Sydney,132033.0,52123.5,Medium
105,Birgit,1,Sydney,700000.0,0.0,High
106,Bercù,1,Paris,832123.0,321000.23,High
107,Elodie,2,New York,78123.11,37345.22,Low
108,Gurumul,1,Sydney,13243.32,121200.0,Low
109,Kwame,2,Paris,456122.17,59621.33,High


<h3>axis = 0</h3>
<li>Operates column by column</li>
<li>Each x is a column as a series with the dataframe index as the index</li>
<li>x.index_value accesses the value of a column</li>
<li><b>Example</b>: standardize the values in all columns</li>

In [4]:
import statistics
df = pd.DataFrame({"index_vals":["A","B","C","D","E","F","G"],
                   "data_col_1":[.4,.7,2.4,3.2,1.3,2.1,1.9],
                   "data_col_2":[4.6,10.2,8.7,9.6,4.6,2.1,11.2]}
                 )
df.set_index(["index_vals"],inplace=True)
df


Unnamed: 0_level_0,data_col_1,data_col_2
index_vals,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.4,4.6
B,0.7,10.2
C,2.4,8.7
D,3.2,9.6
E,1.3,4.6
F,2.1,2.1
G,1.9,11.2


In [4]:
df.apply(lambda x: (x-statistics.mean(x))/statistics.stdev(x),axis=0)

Unnamed: 0_level_0,data_col_1,data_col_2
index_vals,Unnamed: 1_level_1,Unnamed: 2_level_1
A,-1.338073,-0.772682
B,-1.032643,0.838442
C,0.698125,0.406891
D,1.512604,0.665822
E,-0.421784,-0.772682
F,0.392695,-1.491933
G,0.189075,1.126143


<h3>HTML Tables</h3>
<li>Pandas can read a table in an html page into a dataframe</li>
<li>The <span style="color:blue">read_html</span> function reads an html page, extracts the tables (anything in an html <span style="color:blue">table</span> tag) and returns a list of dataframes where each dataframe corresponds to one table</li>
<li>Note that the function returns a <b><span style="color:blue">list</span></b> of dataframes, even if there is only one table on a page</li>
<li>If <span style="color:blue">th (table header)</span> tags exist, read_html extracts them as dataframe column names</li>
<li>Alternatively, read_html takes a header argument which treats the first n rows as headers</li>
<li>An index can be specified using <span style="color:blue">index_col</span></li>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html">documentation</a></li>


In [5]:
import pandas as pd
df_list = pd.read_html('https://www.x-rates.com/table/?from=USD&amount=1')#,index_col="US Dollar")
print(len(df_list))

2


In [7]:
df_list[0]

Unnamed: 0,US Dollar,1.00 USD,inv. 1.00 USD
0,Euro,0.935629,1.068799
1,British Pound,0.808806,1.236391
2,Indian Rupee,82.984917,0.01205
3,Australian Dollar,1.545335,0.647109
4,Canadian Dollar,1.343858,0.744126
5,Singapore Dollar,1.363074,0.733636
6,Swiss Franc,0.896939,1.114903
7,Malaysian Ringgit,4.6877,0.213324
8,Japanese Yen,148.00464,0.006757
9,Chinese Yuan Renminbi,7.285194,0.137265


In [8]:
major_df = df_list[0]
major_df.set_index("US Dollar",inplace=True)


major_df

Unnamed: 0_level_0,1.00 USD,inv. 1.00 USD
US Dollar,Unnamed: 1_level_1,Unnamed: 2_level_1
Euro,0.935629,1.068799
British Pound,0.808806,1.236391
Indian Rupee,82.984917,0.01205
Australian Dollar,1.545335,0.647109
Canadian Dollar,1.343858,0.744126
Singapore Dollar,1.363074,0.733636
Swiss Franc,0.896939,1.114903
Malaysian Ringgit,4.6877,0.213324
Japanese Yen,148.00464,0.006757
Chinese Yuan Renminbi,7.285194,0.137265


In [9]:
major_df.loc["Japanese Yen"]

1.00 USD         148.004640
inv. 1.00 USD      0.006757
Name: Japanese Yen, dtype: float64