# Pandas Tutorials
Practice using pandas techniques to process data.

## <a id="toc">Table of Contents</a>
> 1. [Access Data](#1)
    1. Excel
    
> 2. [DataFrame Validation](#2)
    1. **head()/tail()**: first/last n rows
    2. **shape**: number of rows and columns
    3. **info()**: detailed information about the contents of the dataFrame
    4. **dtype**: DataFrame data type attributes for columns
    5. **columns**: returns column names as a list
    
> 3. [DataFrame Exploration](#3)
    1. **describe()**: summary statistics of numeric values
    2. **Column Slicing**: specify columns of a DataFrame
    3. **Update a value(s)**: update a value in a DataFrame
    4. **Missing Values**:
        - **isnull()**: generates a boolean mask indicating missing values
        - **notnull()**: opposite of isnull()
        - **dropna()**: filtered copy of the original DataFrame with removed missing values
        - **fillna()**: copy of the original DataFrame with missing values filled/imputed


In [28]:
import pandas as pd
import numpy as np
import os


## Point to the data folder. The data folder resides in the same location as the notebook
dataPath = os.getcwd() + '\\data'

## Specify the data files
carsFile = "cars.xlsx"

### <a id=1> 1. Access Data </a>
[Back to contents](#toc)

In [10]:
cars = pd.read_excel(dataPath + '\\' + carsFile)

### <a id=2> 2. DataFrame Validation </a>
[Back to contents](#toc)

#### a. head()/tail()

In [11]:
cars.head()

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
0,Acura,MDX,SUV,Asia,All,36945,33337,3.5,6.0,265,17,23,4451,106,189
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,23820,21761,2.0,4.0,200,24,31,2778,101,172
2,Acura,TSX 4dr,Sedan,Asia,Front,26990,24647,2.4,4.0,200,22,29,3230,105,183
3,Acura,TL 4dr,Sedan,Asia,Front,33195,30299,3.2,6.0,270,20,28,3575,108,186
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,43755,39014,3.5,6.0,225,18,24,3880,115,197


In [27]:
cars.tail(1)

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
427,Volvo,XC70,Wagon,Europe,All,35145,33112,2.5,5.0,208,20,27,3823,109,186


#### b. shape

In [12]:
cars.shape

(428, 15)

#### c. info()

In [14]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         428 non-null    object 
 1   Model        428 non-null    object 
 2   Type         428 non-null    object 
 3   Origin       428 non-null    object 
 4   DriveTrain   428 non-null    object 
 5   MSRP         428 non-null    int64  
 6   Invoice      428 non-null    int64  
 7   EngineSize   428 non-null    float64
 8   Cylinders    426 non-null    float64
 9   Horsepower   428 non-null    int64  
 10  MPG_City     428 non-null    int64  
 11  MPG_Highway  428 non-null    int64  
 12  Weight       428 non-null    int64  
 13  Wheelbase    428 non-null    int64  
 14  Length       428 non-null    int64  
dtypes: float64(2), int64(8), object(5)
memory usage: 50.3+ KB


#### d. dtype

In [24]:
cars.dtypes

Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP             int64
Invoice          int64
EngineSize     float64
Cylinders      float64
Horsepower       int64
MPG_City         int64
MPG_Highway      int64
Weight           int64
Wheelbase        int64
Length           int64
dtype: object

#### e. columns

In [30]:
cars.columns

Index(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'],
      dtype='object')

In [32]:
type(cars.columns)

pandas.core.indexes.base.Index

Return columns as a list

In [35]:
print(type(cars.columns.to_list()))
print(cars.columns.to_list())

<class 'list'>
['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice', 'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway', 'Weight', 'Wheelbase', 'Length']


### <a id=3> 3. DataFrame Exploration </a>
[Back to contents](#toc)

#### a. describe()

In [29]:
cars.describe()

Unnamed: 0,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
count,428.0,428.0,428.0,426.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,32774.85514,30014.700935,3.196729,5.807512,215.885514,20.060748,26.843458,3577.953271,108.154206,186.36215
std,19431.716674,17642.11775,1.108595,1.558443,71.836032,5.238218,5.741201,758.983215,8.311813,14.357991
min,10280.0,9875.0,1.3,3.0,73.0,10.0,12.0,1850.0,89.0,143.0
25%,20334.25,18866.0,2.375,4.0,165.0,17.0,24.0,3104.0,103.0,178.0
50%,27635.0,25294.5,3.0,6.0,210.0,19.0,26.0,3474.5,107.0,187.0
75%,39205.0,35710.25,3.9,6.0,255.0,21.25,29.0,3977.75,112.0,194.0
max,192465.0,173560.0,8.3,12.0,500.0,60.0,66.0,7190.0,144.0,238.0


#### b. Column Slicing

In [37]:
cols=['Make','Model','MSRP']
cars[cols].head()

Unnamed: 0,Make,Model,MSRP
0,Acura,MDX,36945
1,Acura,RSX Type S 2dr,23820
2,Acura,TSX 4dr,26990
3,Acura,TL 4dr,33195
4,Acura,3.5 RL 4dr,43755


In [39]:
cols=['Make','Model','MSRP']
cars.loc[:5,cols]

Unnamed: 0,Make,Model,MSRP
0,Acura,MDX,36945
1,Acura,RSX Type S 2dr,23820
2,Acura,TSX 4dr,26990
3,Acura,TL 4dr,33195
4,Acura,3.5 RL 4dr,43755
5,Acura,3.5 RL w/Navigation 4dr,46100


#### c. Update a value(s)

In [47]:
## Create a test dataFrame
df1 = pd.DataFrame([['cold', 9], ['warm',4],[None, 4]],
                   columns=['Strings', 'Integers'])
df1.head()

Unnamed: 0,Strings,Integers
0,cold,9
1,warm,4
2,,4


In [50]:
## Set the Integers value to None if it equals 9
df1.loc[df1.Integers == 9] = None 
df1.head()

Unnamed: 0,Strings,Integers
0,,
1,warm,4.0
2,,4.0


#### d. Missing Values

In [59]:
## Create a test DataFrame with missing values

colNames = ['Temp', 'Speed', 'Measure1', 'Measure2', 'Measure3', 'Measure4'] 

df2 = pd.DataFrame([['cold','slow',None, 2.7, 6.6, 3.1],
                    ['warm','medium', 4.2, 5.1, 7.9,9.1],
                    ['hot', 'fast', 9.4, 11.0, None, 6.8],
                    ['cool', None, None, None, 9.1, 8.9],
                    ['cool', 'medium', 6.1, 4.3, 12.2, 3.7],
                    [None, 'slow', None, 2.9, 3.3, 1.7],
                    [None, 'slow', None, 2.9, 3.3, 1.7]],
                   columns=colNames)

df2.head(10)

Unnamed: 0,Temp,Speed,Measure1,Measure2,Measure3,Measure4
0,cold,slow,,2.7,6.6,3.1
1,warm,medium,4.2,5.1,7.9,9.1
2,hot,fast,9.4,11.0,,6.8
3,cool,,,,9.1,8.9
4,cool,medium,6.1,4.3,12.2,3.7
5,,slow,,2.9,3.3,1.7
6,,slow,,2.9,3.3,1.7


None for objects, NaN for floats

In [60]:
df2.dtypes

Temp         object
Speed        object
Measure1    float64
Measure2    float64
Measure3    float64
Measure4    float64
dtype: object

#### d1. isnull()

Return the boolean indicating missing values

In [63]:
df2.isnull()

Unnamed: 0,Temp,Speed,Measure1,Measure2,Measure3,Measure4
0,False,False,True,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,True,False
3,False,True,True,True,False,False
4,False,False,False,False,False,False
5,True,False,True,False,False,False
6,True,False,True,False,False,False


Sum the missing values of each column

In [64]:
df2.isnull().sum()

Temp        2
Speed       1
Measure1    4
Measure2    1
Measure3    1
Measure4    0
dtype: int64

#### d2. notnull()

In [65]:
df2.notnull().sum()

Temp        5
Speed       6
Measure1    3
Measure2    6
Measure3    6
Measure4    7
dtype: int64

#### d3. xxx

In [1]:
### <a id=4> 4. Data Validation </a>
[Back to contents](#toc)

SyntaxError: invalid syntax (<ipython-input-1-839ff93d3a89>, line 2)

In [None]:
### <a id=5> 5. Data Validation </a>
[Back to contents](#toc)

In [25]:
### <a id=6> 6. Data Validation </a>
[Back to contents](#toc)