Source: https://github.com/jeffheaton/app_deep_learning/blob/854a0e4a3982daea4e50d34e4d8fc0d9d806b960//t81_558_class_02_1_python_pandas.ipynb

Pandas is based on the dataframe concept found in the R programming language. 

In [9]:
import pandas as pd
import numpy as np

In [2]:
# display function provides a cleaner display than merely printing the data frame 
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 6)
df = pd.read_csv('data/auto-mpg.csv')
display(df)

Unnamed: 0,mpg,cylinders,displacement,...,year,origin,name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
2,18.0,8,318.0,...,70,1,plymouth satellite
...,...,...,...,...,...,...,...
395,32.0,4,135.0,...,82,1,dodge rampage
396,28.0,4,120.0,...,82,1,ford ranger
397,31.0,4,119.0,...,82,1,chevy s-10


Generating a list of dictionaries with statistical information about the dataframe field-by-field (headers)

In [3]:
# strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append(
        {
            "name": field,
            "mean": df[field].mean(),
            "var": df[field].var(),
            "sdev": df[field].std(),
        }
    )
    
for field in fields:
    print(field)

{'name': 'mpg', 'mean': 23.514572864321607, 'var': 61.089610774274405, 'sdev': 7.815984312565782}
{'name': 'cylinders', 'mean': 5.454773869346734, 'var': 2.893415439920003, 'sdev': 1.7010042445332119}
{'name': 'displacement', 'mean': 193.42587939698493, 'var': 10872.199152247384, 'sdev': 104.26983817119591}
{'name': 'weight', 'mean': 2970.424623115578, 'var': 717140.9905256763, 'sdev': 846.8417741973268}
{'name': 'acceleration', 'mean': 15.568090452261307, 'var': 7.604848233611383, 'sdev': 2.757688929812676}
{'name': 'year', 'mean': 76.01005025125629, 'var': 13.672442818627143, 'sdev': 3.697626646732623}
{'name': 'origin', 'mean': 1.5728643216080402, 'var': 0.6432920268850549, 'sdev': 0.8020548777266148}


The following code convert the list of dictionaries into a pd dataframe. To restore default pd display set display values to zero

In [4]:
pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 0)
df2 = pd.DataFrame(data=fields)
display(df2)

Unnamed: 0,name,mean,var,sdev
0,mpg,23.514573,61.089611,7.815984
1,cylinders,5.454774,2.893415,1.701004
2,displacement,193.425879,10872.199152,104.269838
3,weight,2970.424623,717140.990526,846.841774
4,acceleration,15.56809,7.604848,2.757689
5,year,76.01005,13.672443,3.697627
6,origin,1.572864,0.643292,0.802055


### Managing missing values

Missing values are a reality of machine learning. Every dataset has missing values. Most of the values are present in the MPG database. However, there are missing values in the horspower column. A coomon practice is to replace missing values with the median values for that column. The following code replaces any NA values in horsepower with the median.

In [5]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

In [6]:
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

horsepower has na? True


In [7]:
print("Filling missing values...")
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# its common also don't fill with the median but drop the entire row with the missing value
# df = df.dropna()
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

Filling missing values...
horsepower has na? False


### Dealing with outliers

Outliers are values that are unusually high or low. We typically consider outliers to be a value that is several standard deviations from the mean. Sometimes outliers are simply errors; this is a result of observational error. Outliers can also be truly large or small values that may be difficult to adress. The following function can remove such values.

In [6]:
# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[
        (np.abs(df[name] - df[name].mean()) >= (sd*df[name].std()))
    ]
    df.drop(drop_rows, axis=0, inplace=True)

The code below will drop every row from the AutoMpg dataset where the horsepower is two standard deviations or more above of below the mean.

In [8]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

# create feature vector 
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)

# drop the name column
df.drop(columns="name", axis=1, inplace=True)

# drop the outliers in horsepower
print(f"Length before MPG outliers dropped: {len(df)}")
remove_outliers(df, "mpg", 2)
print(f"Length after MPG outliers dropped: {len(df)}")

Length before MPG outliers dropped: 398
Length after MPG outliers dropped: 388


### Dropping Fields
Drop fields that are of no value for the neural networks training.

In [10]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

print(f"Before drop: {list(df.columns)}")
df.drop("name", axis=1, inplace=True)
print(f"After drop: {list(df.columns)}")

Before drop: ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
After drop: ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']


### Concatenating Rows and Columns
Python can concatenate rows and columns together to form new data frames. This code creates a new data frame from the name and horsepower columns in Auto MPG dataset

In [11]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

col_horsepower=df['horsepower']
col_name = df['name']
result = pd.concat([col_name, col_horsepower], axis=1)

result.head()

Unnamed: 0,name,horsepower
0,chevrolet chevelle malibu,130.0
1,buick skylark 320,165.0
2,plymouth satellite,150.0
3,amc rebel sst,150.0
4,ford torino,140.0


The concat function can also concatenate rows together. This code concatenates the first two rows and the last two ros of the Auto MPG dataset.

In [12]:
# create a new dataframe from first 2 rows and last 2 rows
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

result = pd.concat([df[0:2], df[-2:]], axis=0)
result.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
396,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger
397,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10


### Training and Validation
We must evaluate a machine learning model based on its ability to predict values that it has never seen before. Because of this, we often divide the training data into a validation and training set. The machine learning model will learn from the training data but ultimately be evaluated based on the validation data.

