# Pandas

**Pandas** is a python library for data analysis. You can do almost anything you can imagine with data you have, using pandas (as long as you code it out :p). It essentially works like a very powerful database/excel on which you have full and versetile control. This makes it obvious that Pandas is one of the single most useful tool to data analysts/scientists. So we"ll be taking our time understanding how it works, and try our best to reach a certain level of mastery at it. 

In this first introduction section, we will go over the two most important units in pandas:

- Series (```pandas.Series```)
- Dataframes (```pandas.DataFrame```)

## Series

A pandas series is very similar to a numpy array. The main difference between the two is that pandas series can be accessed/indexed using labels (*axis labels*), instead of just a numeric location (To get a more intuitive understanding think of it as something between a numpy array and a dictionary). It also can store any object and not just numeric data.

In [1]:
import pandas as pd
import numpy as np

### Methods to create a series

We can create pandas series from lists, numpy arrays and dictionaries

In [2]:
labels = ["a","b","c"]
my_list = [10,20,30]
arr = np.array([10,20,30])
my_dict = {"a":10,"b":20,"c":30}

In [3]:
# from just a list/array (i.e., only data is provided, index is created from 0, 1, 2,...)

pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [4]:
# from 2 list(s)/array(s) (i.e., both data and index is provided)

pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [5]:
# we can ignore the named parameters as long as we provide our arguments in the right order (data, index)

pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

In [6]:
# if we use a dictionary, where the keys are the indices and the values are the data, we need to pass only one argument
pd.Series(my_dict)

a    10
b    20
c    30
dtype: int64

#### Using Indices

Much like python dictionaries, we can access elements in series using their index.

In [7]:
# note the difference in the indices for later cells

series_1 = pd.Series([1,2,3,4],index = ["USA", "Germany","USSR", "Japan"])  
series_2 = pd.Series([1,2,5,4],index = ["USA", "Germany","Italy", "Japan"])                                     

In [8]:
series_1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [9]:
series_2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [10]:
series_1["USA"], series_2["Germany"]

(1, 2)

> Concatenation of series (+), creates every index for both/multiple series, and places NaN when that index doesnt exist in either

In [11]:
series_1 + series_2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

## DataFrames

DataFrames are the main workhorse unit of pandas, and series are its building blocks. We can think of a DataFrame as a group of Series objects put together to share the same index (i.e. columns).

### Creating a DataFrame

**Note:** While we can store any data inside a dataframe, for simplicity sake, we will stick to numbers during the first part of the course

In [12]:
# data

# setting a seed to produce reproducible results
np.random.seed(911)

data = np.random.randn(5, 5) # remember what this does?

In [13]:
# columns and indices

columns = ["c0", "c1", "c2", "c3", "c4"]
indices = ["a", "b", "c", "d", "e"]

> ```pd.DataFrame()``` is used to create dataframes, given the data, columns, and index either  in lists, arrays or dictionaries

arguments: **data**, **columns**, **index**

In [14]:
# creating dataframe

df = pd.DataFrame(data=data, columns=columns, index=indices)
df

Unnamed: 0,c0,c1,c2,c3,c4
a,-0.42502,1.17431,-0.60952,-0.433977,2.233601
b,0.436149,-2.468962,-1.080083,0.693728,0.663031
c,1.676887,0.790989,-0.56,1.369961,1.045403
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223
e,0.273916,-1.264592,0.567614,-1.566816,0.838729


### Opening an already existing file as a DataFrame

We will learn how to open csv files in this intro section, but will go in deeper to other types in later sections.

>```pd.read_csv()``` is the function provided to open csvs files to ypur runtime, given that you know the path to the file (relative to runtime/global)

arguments: **path**

**Note:** use raw strings ```r"..."``` while passing the path to avoid unnecassary errors

In [15]:
df_read = pd.read_csv(r"datasets/example")

In [16]:
df_read

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


### Working with the data

#### Selection of column(s)

> There are two notations, ```df["<column name>"]``` and ```df.column_name```. The second is not reccomended, one reason being that we cannot use it in cases where the column name has a space.

In [17]:
# notice how the column is actually a series
print(f"type of a single column is: {type(df['c1'])}")

df["c1"]

type of a single column is: <class 'pandas.core.series.Series'>


a    1.174310
b   -2.468962
c    0.790989
d   -0.417832
e   -1.264592
Name: c1, dtype: float64

We can select multiple columns at a time by providing a list of columns while indexing, i.e. for example, ```df[["column1", "column7"]]```

In [18]:
df[["c2", "c0"]]

# selecting more than column, returns a new subset of the dataframe

Unnamed: 0,c2,c0
a,-0.60952,-0.42502
b,-1.080083,0.436149
c,-0.56,1.676887
d,0.73337,-2.687477
e,0.567614,0.273916


#### Creating and modifying columns

In [19]:
# works just like creating a new key-value in a dictionary

df["new"] = df["c0"] + df["c4"]

df

Unnamed: 0,c0,c1,c2,c3,c4,new
a,-0.42502,1.17431,-0.60952,-0.433977,2.233601,1.808581
b,0.436149,-2.468962,-1.080083,0.693728,0.663031,1.09918
c,1.676887,0.790989,-0.56,1.369961,1.045403,2.72229
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223,-2.7137
e,0.273916,-1.264592,0.567614,-1.566816,0.838729,1.112645


In [20]:
# similarly we can modify pre-existing columns

df["new"] = 2*df["new"] # remember vectorization?

df

Unnamed: 0,c0,c1,c2,c3,c4,new
a,-0.42502,1.17431,-0.60952,-0.433977,2.233601,3.617162
b,0.436149,-2.468962,-1.080083,0.693728,0.663031,2.19836
c,1.676887,0.790989,-0.56,1.369961,1.045403,5.44458
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223,-5.4274
e,0.273916,-1.264592,0.567614,-1.566816,0.838729,2.22529


#### Selection of row(s)

In [21]:
# based on index

df.loc["a"]

c0    -0.425020
c1     1.174310
c2    -0.609520
c3    -0.433977
c4     2.233601
new    3.617162
Name: a, dtype: float64

In [22]:
# based on numeric position

df.iloc[0]

c0    -0.425020
c1     1.174310
c2    -0.609520
c3    -0.433977
c4     2.233601
new    3.617162
Name: a, dtype: float64

#### Selection of a subset of the DataFrame

In [23]:
# list of rows followed by a list of columns
df.loc[["a", "c"],["c1", "c2", "c4"]]

Unnamed: 0,c1,c2,c4
a,1.17431,-0.60952,2.233601
c,0.790989,-0.56,1.045403


In [24]:
# we can similarly choose a single element this way

df.loc["a", "c0"]

-0.42501987121812485

#### Deleting/dropping columns/rows

>```df.drop()``` is used to drop rows or columns corresponding to index 0/1 respectiveley

In [25]:
# dropping a row

df.drop("a", axis=0)

Unnamed: 0,c0,c1,c2,c3,c4,new
b,0.436149,-2.468962,-1.080083,0.693728,0.663031,2.19836
c,1.676887,0.790989,-0.56,1.369961,1.045403,5.44458
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223,-5.4274
e,0.273916,-1.264592,0.567614,-1.566816,0.838729,2.22529


>**Important:** THe dataframe itself doesn't delete the row from itself, it returns a view of itself that doesnt have the corresponding row/column. To change the dataframe we are working on, we need to set the **inplace** parameter to  ```True```.

In [26]:
# df is not modified yet, i.e., row 'a' is still there

df

Unnamed: 0,c0,c1,c2,c3,c4,new
a,-0.42502,1.17431,-0.60952,-0.433977,2.233601,3.617162
b,0.436149,-2.468962,-1.080083,0.693728,0.663031,2.19836
c,1.676887,0.790989,-0.56,1.369961,1.045403,5.44458
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223,-5.4274
e,0.273916,-1.264592,0.567614,-1.566816,0.838729,2.22529


In [27]:
# dropping row with inplace

df.drop("a", axis = 0, inplace = True)
df

Unnamed: 0,c0,c1,c2,c3,c4,new
b,0.436149,-2.468962,-1.080083,0.693728,0.663031,2.19836
c,1.676887,0.790989,-0.56,1.369961,1.045403,5.44458
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223,-5.4274
e,0.273916,-1.264592,0.567614,-1.566816,0.838729,2.22529


In [28]:
# similarly dropping columns

df.drop("new", axis = 1, inplace=True)
df

Unnamed: 0,c0,c1,c2,c3,c4
b,0.436149,-2.468962,-1.080083,0.693728,0.663031
c,1.676887,0.790989,-0.56,1.369961,1.045403
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223
e,0.273916,-1.264592,0.567614,-1.566816,0.838729


#### Conditional selection

In [29]:
df

Unnamed: 0,c0,c1,c2,c3,c4
b,0.436149,-2.468962,-1.080083,0.693728,0.663031
c,1.676887,0.790989,-0.56,1.369961,1.045403
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223
e,0.273916,-1.264592,0.567614,-1.566816,0.838729


In [30]:
# creating a boolean df based on a condition we specify

df > 0

Unnamed: 0,c0,c1,c2,c3,c4
b,True,False,False,True,True
c,True,True,False,True,True
d,False,False,True,True,False
e,True,False,True,False,True


In [31]:
# selecting rows/columns without all NaNs (False values in the boolean df are converted to NaN)
df[df > 0]

Unnamed: 0,c0,c1,c2,c3,c4
b,0.436149,,,0.693728,0.663031
c,1.676887,0.790989,,1.369961,1.045403
d,,,0.73337,0.491143,
e,0.273916,,0.567614,,0.838729


In [32]:
df[df["c3"]>0]

Unnamed: 0,c0,c1,c2,c3,c4
b,0.436149,-2.468962,-1.080083,0.693728,0.663031
c,1.676887,0.790989,-0.56,1.369961,1.045403
d,-2.687477,-0.417832,0.73337,0.491143,-0.026223


In [33]:
df[df["c2"]<0][["c1","c3"]]

Unnamed: 0,c1,c3
b,-2.468962,0.693728
c,0.790989,1.369961


In [34]:
# if we want to specify more than one condition, use & and |, instead of and/or (reason discussed later)
df[(df["c1"]>0) & (df["c2"] < 0)]

Unnamed: 0,c0,c1,c2,c3,c4
c,1.676887,0.790989,-0.56,1.369961,1.045403
