# Intro to Pandas

`pip install pandas`

- Pandas: Panel Datasets. 
- is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
- Can also be used for simple data visualization
- Widely used in DS and ML applications for handling **structured** data (tabular).
- It has 2 main data components:
    - Series (1d array) with a header
    - DataFrame (2d array) with headers

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Purpose_of_Pandas.png)

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Features_of_Pandas.png)



![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Data_Structures.png)

In [1]:
import pandas as pd

In [2]:
pd.__version__

'2.0.2'

## Pandas Series

In [5]:
my_list = [2,3,4,5,7,8,8,9,2]

type(my_list)

list

In [4]:
my_ser = pd.Series(my_list)
type(my_ser)

pandas.core.series.Series

In [6]:
my_ser

0    2
1    3
2    4
3    5
4    7
5    8
6    8
7    9
8    2
dtype: int64

In [7]:
my_ser[3]

5

In [8]:
my_ser[3:7]

3    5
4    7
5    8
6    8
dtype: int64

In [12]:
my_ser2 = my_ser[3:7].reset_index(drop=True) #resetting the index and dropping the old one
my_ser2

0    5
1    7
2    8
3    8
dtype: int64

In [15]:
my_ser3 = pd.Series([3,4,5], index=['x','y','z'], name='my_column')
my_ser3

x    3
y    4
z    5
Name: my_column, dtype: int64

In [17]:
my_ser3['x']

3

In [18]:
#converting a numpy array to a series
import numpy as np

my_arr = np.arange(2,25)
my_arr


array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24])

In [19]:
my_ser = pd.Series(my_arr)
my_ser

0      2
1      3
2      4
3      5
4      6
5      7
6      8
7      9
8     10
9     11
10    12
11    13
12    14
13    15
14    16
15    17
16    18
17    19
18    20
19    21
20    22
21    23
22    24
dtype: int64

In [22]:
#convert series back to numpy array

my_arr_conv = my_ser.values
my_arr_conv

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24])

## Pandas Dataframes

**Anatomy of a DataFrame**
 
![df](https://static.packt-cdn.com/products/9781839213106/graphics/Images/B15597_01_01.png)

In [23]:
# build a dictionary
data = {
        'Name': ['Mark', 'Becky', 'Charlie'],
        'Age': [34,27,29],
        'Score': [98,76,84]
}

In [24]:
#convert the dictionary into a dataframe
#common name for a dataframe is df
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Score
0,Mark,34,98
1,Becky,27,76
2,Charlie,29,84


In [25]:
print(df)

      Name  Age  Score
0     Mark   34     98
1    Becky   27     76
2  Charlie   29     84


In [26]:
type(df)

pandas.core.frame.DataFrame

In [27]:
type(df['Name'])

pandas.core.series.Series

### Slicing and Dicing A Dataframe Using `loc()` and `iloc()`

#### Using `loc()`

In [28]:
# get the first row of the data
df.loc[0]

Name     Mark
Age        34
Score      98
Name: 0, dtype: object

In [30]:
# get the first row and 3rd
df.loc[[0,2]]

Unnamed: 0,Name,Age,Score
0,Mark,34,98
2,Charlie,29,84


In [31]:
df.loc[[0,2]].ndim

2

In [32]:
df.shape #3 rows and 3 columns

(3, 3)

In [33]:
#check the data type for each column
df.dtypes

Name     object
Age       int64
Score     int64
dtype: object

In [34]:
# check the num of rows or height of the df
len(df)

3

In [36]:
# grab all rows and first column
df.loc[:,['Age']] # colon means all rows comma Age column

Unnamed: 0,Age
0,34
1,27
2,29


In [37]:
df.loc[:,['Age', 'Score']]

Unnamed: 0,Age,Score
0,34,98
1,27,76
2,29,84


Similar to SQL:
```SQL
SELECT Age, Score
FROM DF
```

You can also use `loc()` as a filter

In [38]:
df.loc[df['Age']>28]

Unnamed: 0,Name,Age,Score
0,Mark,34,98
2,Charlie,29,84


Similar to SQL:
```SQL
SELECT *
FROM DF
WHERE Age > 28
```

#### Using `iloc[]`

`iloc[row range , column range]`

In [39]:
# select the first 2 rows with first 2 columns
df.iloc[:2,:2]

Unnamed: 0,Name,Age
0,Mark,34
1,Becky,27


`loc` vs `iloc`
- iloc is useful when you don't necessarily know the column names in advance
- loc is more advantageous when you can't rely on the columns being in a particular order


### Using NumPy Function On Pandas

In [7]:
# build a dictionary
data = {
        'Name': ['Mark', 'Becky', 'Charlie', 'Jessica', 'James'],
        'Age': [34,27,29, 30, 26],
        'Score': [98,66,84,68, 63],
        'State': ['NY', 'CA', 'TN', 'CO', 'TX']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Score,State
0,Mark,34,98,NY
1,Becky,27,66,CA
2,Charlie,29,84,TN
3,Jessica,30,68,CO
4,James,26,63,TX


**Exercise** Create a new column that lists the Fail or Pass if the score is less than 70

In [8]:
import numpy as np
#defining a new column and adding it to the dataframe
df['Status'] = np.where(df['Score']<70, 'Fail', 'Pass')
df

Unnamed: 0,Name,Age,Score,State,Status
0,Mark,34,98,NY,Pass
1,Becky,27,66,CA,Fail
2,Charlie,29,84,TN,Pass
3,Jessica,30,68,CO,Fail
4,James,26,63,TX,Fail


## Dataframe Iteration Methods

#### Method 1 - Using `index`

In [9]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [10]:
print(f"{df.loc[0,'Name']} is {df.loc[0,'Age']} years old")

Mark is 34 years old


We can use a loop to build the same thing for all rows.

In [11]:
for i in df.index:
    print(f"{df.loc[i,'Name']} is {df.loc[i,'Age']} years old")

Mark is 34 years old
Becky is 27 years old
Charlie is 29 years old
Jessica is 30 years old
James is 26 years old


#### Method 2 - using `iterrows()`

In [12]:
for i, row in df.iterrows():
    print(f"Student {i+1}: {row['Name']} lives in {row['State']}")

Student 1: Mark lives in NY
Student 2: Becky lives in CA
Student 3: Charlie lives in TN
Student 4: Jessica lives in CO
Student 5: James lives in TX


update the age column to have 5 years added

In [13]:
df['Age'] = df['Age'] + 5
df.head()

Unnamed: 0,Name,Age,Score,State,Status
0,Mark,39,98,NY,Pass
1,Becky,32,66,CA,Fail
2,Charlie,34,84,TN,Pass
3,Jessica,35,68,CO,Fail
4,James,31,63,TX,Fail


> Note: when applying modifications on dataframes, it's recommended to always make a copy of the original dataframe.

In [14]:
df_org = df.copy() #make sure you run this before the modification

### Using `apply()` and `map()`

In [15]:
#conver the score to a decimal percentage
df['Score'] = df.apply(lambda row: row['Score']/100, axis=1)
df

Unnamed: 0,Name,Age,Score,State,Status
0,Mark,39,0.98,NY,Pass
1,Becky,32,0.66,CA,Fail
2,Charlie,34,0.84,TN,Pass
3,Jessica,35,0.68,CO,Fail
4,James,31,0.63,TX,Fail


`apply()` can be really useful when you have a complex function 

In [16]:
def age_categorization(row):
    if row['Age'] >=35:
        return 'Over 35'
    elif row['Age'] > 31:
        return 'Between 31 and 35'
    else:
        return '31 and below'
    

df['AgeCategory'] = df.apply(age_categorization, axis=1)
df
    

Unnamed: 0,Name,Age,Score,State,Status,AgeCategory
0,Mark,39,0.98,NY,Pass,Over 35
1,Becky,32,0.66,CA,Fail,Between 31 and 35
2,Charlie,34,0.84,TN,Pass,Between 31 and 35
3,Jessica,35,0.68,CO,Fail,Over 35
4,James,31,0.63,TX,Fail,31 and below


In [17]:
data_measure = {
    #measurements in meters
        'measurement1':[4,5,7,11,10],
        'measurement2':[20,17,18,19,23]
}

df_m = pd.DataFrame(data_measure)
df_m

Unnamed: 0,measurement1,measurement2
0,4,20
1,5,17
2,7,18
3,11,19
4,10,23


In [18]:
#convert the measurements to cm
df_m_cm = df_m.apply(lambda x: x*100, axis=0)
df_m_cm

Unnamed: 0,measurement1,measurement2
0,400,2000
1,500,1700
2,700,1800
3,1100,1900
4,1000,2300


#### Using `map()` Function

**Exercise** Add a column for region using `map()` function

In [19]:
df['State']

0    NY
1    CA
2    TN
3    CO
4    TX
Name: State, dtype: object

In [20]:
df['Region'] = df['State'].map({
                                    'NY':'North',
                                    'CA':'West',
                                    'TN':'South',
                                    'CO':'MidWest',
                                    'TX':'South'
                                    })

df

Unnamed: 0,Name,Age,Score,State,Status,AgeCategory,Region
0,Mark,39,0.98,NY,Pass,Over 35,North
1,Becky,32,0.66,CA,Fail,Between 31 and 35,West
2,Charlie,34,0.84,TN,Pass,Between 31 and 35,South
3,Jessica,35,0.68,CO,Fail,Over 35,MidWest
4,James,31,0.63,TX,Fail,31 and below,South


In SQL:
```SQl
CASE WHEN STATE = 'NY' THEN 'NORTH'
    WHEN STATE =....
```


Save the results to csv file:

In [21]:
df

Unnamed: 0,Name,Age,Score,State,Status,AgeCategory,Region
0,Mark,39,0.98,NY,Pass,Over 35,North
1,Becky,32,0.66,CA,Fail,Between 31 and 35,West
2,Charlie,34,0.84,TN,Pass,Between 31 and 35,South
3,Jessica,35,0.68,CO,Fail,Over 35,MidWest
4,James,31,0.63,TX,Fail,31 and below,South


Saving the edited df into a csv file

In [22]:
df.to_csv('my_df.csv')

Sorting data

In [25]:
df.sort_values(by='Age', inplace=True) #using inplace will overwrite the original df
df.reset_index(drop=True)

Unnamed: 0,Name,Age,Score,State,Status,AgeCategory,Region
0,James,31,0.63,TX,Fail,31 and below,South
1,Becky,32,0.66,CA,Fail,Between 31 and 35,West
2,Charlie,34,0.84,TN,Pass,Between 31 and 35,South
3,Jessica,35,0.68,CO,Fail,Over 35,MidWest
4,Mark,39,0.98,NY,Pass,Over 35,North


In [28]:
#descending order
df.sort_values(by='Age', inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Name,Age,Score,State,Status,AgeCategory,Region
0,Mark,39,0.98,NY,Pass,Over 35,North
1,Jessica,35,0.68,CO,Fail,Over 35,MidWest
2,Charlie,34,0.84,TN,Pass,Between 31 and 35,South
3,Becky,32,0.66,CA,Fail,Between 31 and 35,West
4,James,31,0.63,TX,Fail,31 and below,South


In [27]:
#if you don't want to use inplace=True - also make a copy called df2
df2 = df.sort_values(by='Age', ascending=False).reset_index(drop=True)

In [None]:
df2 = df.sort_values(by=['Age', 'Score'], ascending=[False, True]).reset_index(drop=True)