Sources: 
- https://github.com/jeffheaton/app_deep_learning
- https://maxhalford.github.io/blog/target-encoding/


Pandas is based on the dataframe concept found in the R programming language. 

In [2]:
import pandas as pd
import numpy as np

In [2]:
# display function provides a cleaner display than merely printing the data frame 
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 6)
df = pd.read_csv('data/auto-mpg.csv')
display(df)

Unnamed: 0,mpg,cylinders,displacement,...,year,origin,name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
2,18.0,8,318.0,...,70,1,plymouth satellite
...,...,...,...,...,...,...,...
395,32.0,4,135.0,...,82,1,dodge rampage
396,28.0,4,120.0,...,82,1,ford ranger
397,31.0,4,119.0,...,82,1,chevy s-10


Generating a list of dictionaries with statistical information about the dataframe field-by-field (headers)

In [3]:
# strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append(
        {
            "name": field,
            "mean": df[field].mean(),
            "var": df[field].var(),
            "sdev": df[field].std(),
        }
    )
    
for field in fields:
    print(field)

{'name': 'mpg', 'mean': 23.514572864321607, 'var': 61.089610774274405, 'sdev': 7.815984312565782}
{'name': 'cylinders', 'mean': 5.454773869346734, 'var': 2.893415439920003, 'sdev': 1.7010042445332119}
{'name': 'displacement', 'mean': 193.42587939698493, 'var': 10872.199152247384, 'sdev': 104.26983817119591}
{'name': 'weight', 'mean': 2970.424623115578, 'var': 717140.9905256763, 'sdev': 846.8417741973268}
{'name': 'acceleration', 'mean': 15.568090452261307, 'var': 7.604848233611383, 'sdev': 2.757688929812676}
{'name': 'year', 'mean': 76.01005025125629, 'var': 13.672442818627143, 'sdev': 3.697626646732623}
{'name': 'origin', 'mean': 1.5728643216080402, 'var': 0.6432920268850549, 'sdev': 0.8020548777266148}


The following code convert the list of dictionaries into a pd dataframe. To restore default pd display set display values to zero

In [4]:
pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 0)
df2 = pd.DataFrame(data=fields)
display(df2)

Unnamed: 0,name,mean,var,sdev
0,mpg,23.514573,61.089611,7.815984
1,cylinders,5.454774,2.893415,1.701004
2,displacement,193.425879,10872.199152,104.269838
3,weight,2970.424623,717140.990526,846.841774
4,acceleration,15.56809,7.604848,2.757689
5,year,76.01005,13.672443,3.697627
6,origin,1.572864,0.643292,0.802055


### Managing missing values

Missing values are a reality of machine learning. Every dataset has missing values. Most of the values are present in the MPG database. However, there are missing values in the horspower column. A coomon practice is to replace missing values with the median values for that column. The following code replaces any NA values in horsepower with the median.

In [5]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

In [6]:
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

horsepower has na? True


In [7]:
print("Filling missing values...")
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# its common also don't fill with the median but drop the entire row with the missing value
# df = df.dropna()
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

Filling missing values...
horsepower has na? False


### Dealing with outliers

Outliers are values that are unusually high or low. We typically consider outliers to be a value that is several standard deviations from the mean. Sometimes outliers are simply errors; this is a result of observational error. Outliers can also be truly large or small values that may be difficult to adress. The following function can remove such values.

In [6]:
# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[
        (np.abs(df[name] - df[name].mean()) >= (sd*df[name].std()))
    ]
    df.drop(drop_rows, axis=0, inplace=True)

The code below will drop every row from the AutoMpg dataset where the horsepower is two standard deviations or more above of below the mean.

In [8]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

# create feature vector 
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)

# drop the name column
df.drop(columns="name", axis=1, inplace=True)

# drop the outliers in horsepower
print(f"Length before MPG outliers dropped: {len(df)}")
remove_outliers(df, "mpg", 2)
print(f"Length after MPG outliers dropped: {len(df)}")

Length before MPG outliers dropped: 398
Length after MPG outliers dropped: 388


### Dropping Fields
Drop fields that are of no value for the neural networks training.

In [10]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

print(f"Before drop: {list(df.columns)}")
df.drop("name", axis=1, inplace=True)
print(f"After drop: {list(df.columns)}")

Before drop: ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
After drop: ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']


### Concatenating Rows and Columns
Python can concatenate rows and columns together to form new data frames. This code creates a new data frame from the name and horsepower columns in Auto MPG dataset

In [11]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

col_horsepower=df['horsepower']
col_name = df['name']
result = pd.concat([col_name, col_horsepower], axis=1)

result.head()

Unnamed: 0,name,horsepower
0,chevrolet chevelle malibu,130.0
1,buick skylark 320,165.0
2,plymouth satellite,150.0
3,amc rebel sst,150.0
4,ford torino,140.0


The concat function can also concatenate rows together. This code concatenates the first two rows and the last two ros of the Auto MPG dataset.

In [12]:
# create a new dataframe from first 2 rows and last 2 rows
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

result = pd.concat([df[0:2], df[-2:]], axis=0)
result.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
396,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger
397,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10


### Training and Validation
We must evaluate a machine learning model based on its ability to predict values that it has never seen before. Because of this, we often divide the training data into a validation and training set. The machine learning model will learn from the training data but ultimately be evaluated based on the validation data.

The code below splits the MPG data into a training and validation set. The training set uses 80% of the data, and the validation set uses 20%. 

In [3]:
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", na_values=["NA", "?"])

# usually a god idea to shuffle
df = df.reindex(np.random.permutation(df.index))

mask = np.random.rand(len(df)) < 0.8
train_df = pd.DataFrame(df[mask])
validation_df = pd.DataFrame(df[~mask])

print(f'Training DF: {len(train_df)}')
print(f'Validation DF: {len(validation_df)}')

Training DF: 330
Validation DF: 68


### Converting a Dataframe to a Matrix
The dataframe values property can be used to convert the numerical data to numpy matrix

In [4]:
df[
    [
        "mpg",
        "cylinders",
        "horsepower",
        "weight",
        "acceleration",
        "year",
        "origin"
    ]
].values

array([[ 15. ,   8. , 145. , ...,  13. ,  73. ,   1. ],
       [ 16. ,   6. , 100. , ...,  18. ,  73. ,   1. ],
       [ 30.5,   4. ,  63. , ...,  17. ,  77. ,   1. ],
       ...,
       [ 31. ,   4. ,  68. , ...,  17.6,  82. ,   3. ],
       [ 14. ,   8. , 140. , ...,  16. ,  74. ,   1. ],
       [ 24.5,   4. ,  60. , ...,  22.1,  76. ,   1. ]])

### Managing categorial and continuous values
Neural networks require their input to be a fixed number of columns. This input format is very similar to spreadsheet data; it must be entirely numeric. It is essential to represent the data so that the neural network can train from it. The four basic types of data are:

- Character Data (strings)
- - Nominal: Individual discrete items, no order. For example, color, zip code, and shape.
- - Ordinal: Individual distinct items have an implied order. For example, grade level, job title, Starbucks(tm) coffe size (tall, vente, grande)
- Numeric Data
- - Interval: Numeric values, no defined start. For example, temperature.
- - Ratio: Numeric values, clearly defined start. For example, speed

### Encoding continous Values
One common transformation is to normalize the inputs. It is sometimes valuable to normalize numeric inputs in a standard form so that the program can easily compare these two values. Like percentages for simple matematical operations, in machine learning the z-score is a common way to normalize contininuus data in machine learning, and its defined as:

$$
z = \frac{x - \mu}{\sigma}
$$

where $\mu$ is the mean $\bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

and $\sigma$ is the standard deviation  $= \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}$

The following code replaces the mpg with a z-score. Cars with average MPG will be near zero, avobe zero is above average, and below zero is below average. Z-scores more that 3 above or below are very rare; these are outliers.

In [6]:
from scipy.stats import zscore

In [7]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

df["mpg"] = zscore(df["mpg"])
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,-0.706439,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,-1.090751,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,-0.706439,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,-0.962647,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,-0.834543,8,302.0,140.0,3449,10.5,70,1,ford torino


### Encoding categorical values as Dummies
The traditional means of encoding categorical values is to make them dummy variables. This technique is also called `one-hot-encoding`. Consider the following data set. The area column is not numeric, so you must encode it with one-hot encoding. We display the number of areas and individual values. To show the number of categorical classes in the 'area' column:

In [2]:
df = pd.read_csv("data/jh-simple-dataset.csv", na_values=["NA", "?"])
df.head()

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


In [7]:
areas = list(df["area"].unique())
areas_sorted = sorted(areas)
print(f"Number of areas: {len(areas_sorted)}")
print(f"Areas: {areas_sorted}")

Number of areas: 4
Areas: ['a', 'b', 'c', 'd']


There are four unique values in the area column. To encode these dummy variables, we would use four columns, each representing one of the areas. For each row, one column would have a value of one, the rest zeros. For this reason , this type of encoding is sometimes called one-hot econding. The following code shows hot you might encode the values "a" through "d". The value A becomes [1,0,0,0] and the value B becomes [0,1,0,0]...

In [8]:
dummies = pd.get_dummies(areas_sorted, prefix="area", dtype=int)
print(dummies)

   area_a  area_b  area_c  area_d
0       1       0       0       0
1       0       1       0       0
2       0       0       1       0
3       0       0       0       1


In [9]:
# encode the categorical column of the dataframe named 'area'
dummies = pd.get_dummies(df["area"], prefix="area", dtype=int)
dummies.head()

Unnamed: 0,area_a,area_b,area_c,area_d
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,0,1


In [10]:
# merge the dummies into the original dataframe
df = pd.concat([df, dummies], axis=1)
df.head()

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product,area_a,area_b,area_c,area_d
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b,0,0,1,0
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c,0,0,1,0
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b,0,0,1,0
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b,0,0,1,0
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a,0,0,0,1


In [11]:
# remove the original column 'area' because the goal is to get the data frame to be entirely numeric for the training
df.drop("area", axis=1, inplace=True)
df.head()

Unnamed: 0,id,job,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product,area_a,area_b,area_c,area_d
0,1,vv,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b,0,0,1,0
1,2,kd,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c,0,0,1,0
2,3,pe,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b,0,0,1,0
3,4,11,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b,0,0,1,0
4,5,kl,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a,0,0,0,1


### Encoding categorical values removing the first level of one-hot encoding
The pd.concat function also includes a parameter named drop_first, which specifies whether to get k-1 dummies out of k categorical levels by removing the first level. Why would you want to remove the first level, in this case, area_a? This technique provides a more efficient encoding by using the ordinarily unused encoding [0,0,0]. We encode the area to just three columns and map the categorical value of a to [0,0,0]. The following code demonstrates this technique.

In [13]:
df = pd.read_csv("data/jh-simple-dataset.csv", na_values=["NA", "?"])

# encode the area column as dummy variables dropping the first one-hot encoding category column
dummies = pd.get_dummies(df["area"], drop_first=True, prefix="area", dtype=int)
df = pd.concat([df, dummies], axis=1)
df.drop("area", axis=1, inplace=True)
df.head()

Unnamed: 0,id,job,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product,area_b,area_c,area_d
0,1,vv,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b,0,1,0
1,2,kd,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c,0,1,0
2,3,pe,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b,0,1,0
3,4,11,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b,0,1,0
4,5,kl,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a,0,0,1


### Target encoding for cetegoricals (not recomended)
Target encodign can somethimes increases the predictive power of a machine learning model. However, it also dramatically increases the risk of overfitting. This method must be used with care.

Generally, target encoding can only be used on a categorical feature when the output of the machine learning model is numeric (regression).

The concept of target encoding is straightforward. For each category, we calculate the average target value for that category. Then to encode, we substitute the percent corresponding to the category hat that the categorical value has. Unlike dummy variables, where you have a column for each category with target encoding, the program only needs a simgle column. In this way, target coding is more efficient than dummy variables.

In [3]:
np.random.seed(43)
df = pd.DataFrame(
    {
        "cont_9": np.random.rand(10)*100,
        "cat_0": ["dog"]*5 + ["cat"]*5,
        "cat_1": ["wolf"]*9 + ["tiger"]*1,
        "y": [1,0,1,1,1,1,0,0,0,0]
    }
)

df.head()

Unnamed: 0,cont_9,cat_0,cat_1,y
0,11.505457,dog,wolf,1
1,60.906654,dog,wolf,0
2,13.339096,dog,wolf,1
3,24.058962,dog,wolf,1
4,32.713906,dog,wolf,1


To replace the categories with a number rather than creating dummy variables for "dog" and "cat", we would like to change them to a number. We could use 0 for a cat and 1 for a dog. However, we can encode more information than just that. The simple 0 or 1 would also only work for one animal. Consider what the mean target value is for cat and dog.

In [4]:
means_0 = df.groupby("cat_0")["y"].mean().to_dict()
means_0

{'cat': 0.2, 'dog': 0.8}

The danger is that we are now using the target value (y) for training. This technique will potentially lead to overfitting. The possibility of overfitting is even greater if a small number of a particular category. To prevent this from happening, we use a weighting factor. The stronger the weight, the more categories with fewer values will tend towards the overall average of 
. You can perform this calculation as follows.

In [5]:
def calc_smooth_mean(df1, df2, cat_name, target, weight):
    # compute the global mean
    mean = df[target].mean()
    
    # compute the number of values and the mean of each group
    agg = df.groupby(cat_name)[target].agg(["count", "mean"])
    counts = agg["count"]
    means = agg["mean"]
    
    # compute the "smoothed" means
    smooth = (counts*means+weight*mean)/(counts+weight)
    
    # replace each value by the according smoothed mean
    if df2 is None:
        return df1[cat_name].map(smooth)
    else:
        return df1[cat_name].map(smooth), df2[cat_name].map(smooth.to_dict())

Encoding example

In [6]:
weight = 5
df["cat_0_enc"] = calc_smooth_mean(df1=df, df2=None, cat_name="cat_0", target="y", weight=weight)
df["cat_1_enc"] = calc_smooth_mean(df1=df, df2=None, cat_name="cat_1", target="y", weight=weight)
df.head()

Unnamed: 0,cont_9,cat_0,cat_1,y,cat_0_enc,cat_1_enc
0,11.505457,dog,wolf,1,0.65,0.535714
1,60.906654,dog,wolf,0,0.65,0.535714
2,13.339096,dog,wolf,1,0.65,0.535714
3,24.058962,dog,wolf,1,0.65,0.535714
4,32.713906,dog,wolf,1,0.65,0.535714


### Shuffling the dataset
There may be information lurking in the order of the rows of your dataset. Unless you are dealing with time-series data, the order of the rows sould not be significant. Consider if your tarining set included employees in a company. Perhaps this dataset is ordered by the number of years the employees were with te company. It is okay to have an individual column that specifies years of service. However, having the data in this order might be problematic.
Consider if you were to split the data into a traininga and validation. You could end up with your validation set having only the newer employees and the training set longer-term employees. Separating the data into a k-fold cross validation could have similar problems. Because of these issues, it is important to shuffle the data set.
Often Shuffling and reindexing are both performed together. Shuffling randomizes the order of the data set. However, it does not change the pandas row numbers. The following code demonstrates a reshuffle. `notice that the program has not reset the row indexes' first column`. Generally this will not cause any issues and allows tracing back to the original order of the data. However, it is recommended to reset this index.

In [2]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

# the numpy array is a random permutation of the index of the dataframe, when this np.array is passed to df.reindex all the rows in the dataframe will be shuffled
shuffled_index = np.random.permutation(df.index)
df = df.reindex(shuffled_index)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
137,13.0,8,350.0,150.0,4699,14.5,74,1,buick century luxus (sw)
29,27.0,4,97.0,88.0,2130,14.5,71,3,datsun pl510
45,18.0,6,258.0,110.0,2962,13.5,71,1,amc hornet sportabout (sw)
186,27.0,4,101.0,83.0,2202,15.3,76,2,renault 12tl
314,26.4,4,140.0,88.0,2870,18.1,80,1,ford fairmont


In [3]:
# notice that the new index is altered, to reset the index
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,13.0,8,350.0,150.0,4699,14.5,74,1,buick century luxus (sw)
1,27.0,4,97.0,88.0,2130,14.5,71,3,datsun pl510
2,18.0,6,258.0,110.0,2962,13.5,71,1,amc hornet sportabout (sw)
3,27.0,4,101.0,83.0,2202,15.3,76,2,renault 12tl
4,26.4,4,140.0,88.0,2870,18.1,80,1,ford fairmont


### Sorting a Data Set
Sorting the dataset allows you to order the rows in either ascending or descending order for one or more columns. The following code sorts the MPG dataset by name and displays the first car.

In [4]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

df = df.sort_values(by="name", ascending=True)
print(f"The first car is: {df['name'].iloc[0]}")

df.head()

The first car is: amc ambassador brougham


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
96,13.0,8,360.0,175.0,3821,11.0,73,1,amc ambassador brougham
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl
66,17.0,8,304.0,150.0,3672,11.5,72,1,amc ambassador sst
315,24.3,4,151.0,90.0,3003,20.1,80,1,amc concord
257,19.4,6,232.0,90.0,3210,17.2,78,1,amc concord


### Grouping a dataset 
Grouping is a typical operation on data sets. Structured Query Language (SQL) calls this operation a `GROUP BY`. Programmers use grouping to summarize data. Because of this, the summarization row count will usually shrink, and you cannot undo the grouping. Because this loss of information, it is essencial to keep your original data before the grouping. 

You can use the dataset with the group to perform summaries. For example, the following code will group cylinders by the average (mean). This code will provide the grouping. In addition to mean, you can use other aggregating function, such as sum or count.

In [5]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])

g = df.groupby("cylinders")["mpg"].mean()
g

cylinders
3    20.550000
4    29.286765
5    27.366667
6    19.985714
8    14.963107
Name: mpg, dtype: float64

It might be useful to have these mean values as a dictionary. A dictionary allows you to access an individual element quickly. For example, you could quickly look up the mean for six-cylinder cars.

In [6]:
d = g.to_dict()
d[6]

19.985714285714284

In [7]:
# redefine g as a dictionary
g = df.groupby("cylinders")["mpg"].count().to_dict()

In [8]:
g

{3: 4, 4: 204, 5: 3, 6: 84, 8: 103}

## Mapping and functions for pre-processing
Remembering mapping, comprehension, filtering, lambda functions and reduce in python:

The folowing function can be used to trim white space from a string and capitalize the first letter.

In [10]:
def process_string(str):
    t = str.strip()
    return t[0].upper() + t[1:] # t[1:] means "every string member starting from index = 1 to the end of the string"

Example:

In [11]:
str = process_string(" hello ")
print(f'"{str}"')

"Hello"


The map function takes a list and applies a function to each member of the list and returns a second list that is the same size as the original list.

In [12]:
l = [" apple ", " orange ", " mango ", " strawberry ", " blueberry ", " pine apple "]
mapped_list = list(map(process_string, l))
mapped_list

['Apple', 'Orange', 'Mango', 'Strawberry', 'Blueberry', 'Pine apple']

Similar list processing can be done with python comprehension

In [13]:
l = [" apple ", " orange ", " mango ", " strawberry ", " blueberry ", " pine apple "]
comp_list = [process_string(x) for x in l]
comp_list

['Apple', 'Orange', 'Mango', 'Strawberry', 'Blueberry', 'Pine apple']

While a map function always creates a new list of the same size as the original, the filter function creates a potentially smaller list

In [14]:
def greather_than_five(x):
    if x >= 5:
        return x

In [15]:
l = [1, 10, 20, 3, -2, 0]
greather_list = list(filter(greather_than_five, l))
greather_list

[10, 20]

It might seem somewhat tedious to have to create an entire function just to check to see if a value is greather than 5. A lambda saves you this effort. A lambda is essentially an unnamed function

In [17]:
l = [1, 10, 20, 3, -2, 0]
lambda_list = list(filter(lambda x : x >= 5, l))
lambda_list

[10, 20]

Like filter and map the reduce function also works on a list. However, the result of the reduce is a single value. Consider if you wanted to sum the values of a list.

In [18]:
from functools import reduce

l = [1, 10, 20, 3, -2, 0]
result = reduce(lambda x, y: x + y, l)
print(result)

32


If you've ever worked with Big Data or functional programming languages before, you've likely heard of map/reduce. Map and reduce are two `funtional programming` techniques that allow you to use function across an entire data frame. In addition to functions that you write, Pandas also provides several standard functions for use with dataframes.

### Using map with dataframes
The map function allows you to transform a column by mapping certain values in that column to other values. Consider the Auto MPG dataset that contains a fiel origin_name that holds a value between one and three that indicates the geographic origin of each car. We can see how to use the map function to transform this numeric origin to the textual name of each other.

In [20]:
df = pd.read_csv("data/auto-mpg.csv", na_values=["NA", "?"])
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


The map method in Pandas operates on a single column. You provide map with a dictionary of values to transform the target column. The map keys specify what values in the target column should be turned into values specified by those keys. The following code shows hot the map function can transform the numeric values of 1, 2 and 4 intro the string values of North America, Europe, and Asia.

In [21]:
origin_dict = {1: "North America", 2: "Europe", 3: "Asia"}
df["origin_name"] = df["origin"].map(origin_dict)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,origin_name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,North America
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,North America
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,North America
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,North America
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,North America


### Using df.Apply
The apply function of hte dataframe can run a function over the entire dataframe. You can use either a traditional named function or a lambda function. Python will execute the provided function against each of the rows or columns in the dataframe. The axis parameter specifies that the function is run across rows or columns. With axis = 1, rows are used. The following code calculates a series called efficiency that is the displacement divided by horsepower. 

In [22]:
efficiency = df.apply(lambda x: x["displacement"]/x["horsepower"], axis=1)
df["efficiency"] = efficiency
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,origin_name,efficiency
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,North America,2.361538
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,North America,2.121212
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,North America,2.12
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,North America,2.026667
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,North America,2.157143


### Feature engineering with Pandas

In this section, we will see how to calculate a complex feature using map, apply, and grouping. The dataset is the following CSV:
- https://www.irs.gov/pub/irs-soi/16zpallagi.csv

US Goverment publicdata for "SOI Tax Stats - Individual Tax Statistics: metadata
- https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi

For this feature, we will attempt to estimate the adjusted gross income (AGI) for each of the zip codes. The data file contains many columns however, you will only use the following:

- STATE: The state (e.g., MO)
- zipcode: (e.g. 63017)
- agi_stub: six different brackets of annual income (1 to 6)
- N1: number of tax teturns for each of the agi_stubs

Note that the file will have six rows for each zip code for each of the agi_stub brackets. You can skip codes with 0 or 99999. 

We will create an output CSV with these columns; however, only one row per zip code. Calculate a weighted average of the income brackets. For example the following six rows are present for zipcode 63017:

In [4]:
df = pd.read_csv("data/16zpallagi.csv")
df.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,1,AL,0,1,815440,477700,105350,221200,440830,1296920,...,367320,330066,0,0,0,0,63420,51444,711580,1831661
1,1,AL,0,2,495830,211930,142340,128890,272440,996240,...,398050,984297,0,0,0,0,74090,110889,416090,1173463
2,1,AL,0,3,263390,83420,137870,36340,154880,584000,...,253180,1349246,0,0,0,0,64000,143060,195130,543284
3,1,AL,0,4,167190,29420,124060,10610,99700,421720,...,165830,1425430,0,0,0,0,45020,128920,117410,381329
4,1,AL,0,5,217440,20240,188080,4880,129410,601040,...,216720,3922449,390,155,60,19,82940,423629,126130,506526


In [7]:
# creating df subset where zipcode = 63017
filter_df = df[df['zipcode']==63017]
filter_df

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
85118,29,MO,63017,1,4710,4020,420,230,2140,3830,...,1950,1688,0,0,0,0,550,411,3430,3769
85119,29,MO,63017,2,2780,1740,700,270,1360,4400,...,2400,6833,0,0,0,0,570,894,2040,4481
85120,29,MO,63017,3,2130,1110,770,170,1240,3720,...,2030,12046,0,0,0,0,650,1560,1320,3654
85121,29,MO,63017,4,2010,670,1170,140,1130,4270,...,1970,17886,0,0,0,0,630,1806,1220,4479
85122,29,MO,63017,5,5240,780,4230,180,2940,13720,...,5200,98240,40,16,0,0,1910,8883,2850,13036
85123,29,MO,63017,6,3510,290,3130,70,2400,10340,...,3500,406897,1870,3523,2060,5636,1660,36115,1290,14257


We must combine these six rows into one. For privacy reasons, AGI'x are broken out into 6 buckets. We need to combine the buckets and estime the actual AGI of a zipcode. To do this, consider the values for N1:

- 1 = 1 to 25,000
- 2 = 25,000 to 50,000
- 3 = 50,000 to 75,000
- 4 = 75,000 to 100,000
- 5 = 100,000 to 200,000
- 6 = 200,000 or more

The median of each of this ranges is:

- 1 = 12,500
- 2 = 37,500
- 3 = 62,500
- 4 = 87,500
- 5 = 112,500
- 6 = 212,500

To calculate the average AGI per zipcode, first we trimm all zip codes that are either 0 or 99999 and select the four fields that we need: STATE, zipcode, agi_stub, and N1

In [10]:
df = df.loc[(df["zipcode"] != 0) & (df["zipcode"] != 99999), ["STATE","zipcode","agi_stub", "N1"]]
df.tail()

Unnamed: 0,STATE,zipcode,agi_stub,N1
179785,WY,83414,2,40
179786,WY,83414,3,40
179787,WY,83414,4,0
179788,WY,83414,5,40
179789,WY,83414,6,30


We replace all of the agi_stub values with the correct median values with the map function

In [11]:
median_dict = {1: 12500, 2: 37500, 3: 62500, 4: 87500, 5: 112500, 6: 212500}
df["agi_stub"] = df["agi_stub"].map(median_dict)
df.tail()

Unnamed: 0,STATE,zipcode,agi_stub,N1
179785,WY,83414,37500,40
179786,WY,83414,62500,40
179787,WY,83414,87500,0
179788,WY,83414,112500,40
179789,WY,83414,212500,30


Grouping the dataframe by zipcode

In [12]:
groups = df.groupby(by="zipcode")

The following code applies a lambda across the groups and calculates the AGI estimate

In [15]:
df = pd.DataFrame(groups.apply(lambda x: sum(x["N1"]*x["agi_stub"])/sum(x["N1"]), include_groups=False)).reset_index()
df.tail()

Unnamed: 0,zipcode,0
29867,99921,48042.168675
29868,99922,32954.545455
29869,99925,45639.534884
29870,99926,41136.363636
29871,99929,45911.214953


In [16]:
# rename the columns
df.columns = ["zipcode", "agi_estimate"]

In [17]:
# check to see the AGI for the zipcode 63017
agi_63017 = df[df["zipcode"]==63017]
agi_63017

Unnamed: 0,zipcode,agi_estimate
19909,63017,88689.892051


Feature engineering is an essential part of machine learning. For now, we will manually engineer features. However, later in this course, we will see some techniques for automatic feature engineering.



### Calculated Fields
It is possible to add new fields to the data frame that your program calculates from the other fields. We can create a new column that gives the weight in kilograms. The equation to calculate a metric weight, given in pounds, is: 
$$
m_{(kg)} = m_{(lb)} \times 0.45359237
$$

