<a href="https://colab.research.google.com/github/happyrabbit/IntroDataScience/blob/master/Python/DataWrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# You can use the following two lines to check the python version
# import sys
# print(sys.version)

# Import packages
import pandas as pd
import numpy as np
from scipy import stats 

# Read and write data

You can read data using `read_csv` in `pandas`

In [0]:
# Read the data
SimDat = pd.read_csv("http://bit.ly/2P5gTw4")

In [3]:
## Check the head of the data
SimDat.head()

Unnamed: 0,age,gender,income,house,store_exp,online_exp,store_trans,online_trans,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,segment
0,57,Female,120963.400958,Yes,529.134363,303.512475,2,2,4,2,1,2,1,4,1,4,2,4,Price
1,63,Female,122008.10495,Yes,478.005781,109.52971,4,2,4,1,1,2,1,4,1,4,1,4,Price
2,59,Male,114202.295294,Yes,490.810731,279.249582,7,2,5,2,1,2,1,4,1,4,1,4,Price
3,60,Male,113616.337078,Yes,347.809004,141.669752,10,2,5,2,1,3,1,4,1,4,2,4,Price
4,51,Male,124252.552787,Yes,379.62594,112.237177,4,4,4,1,1,3,1,4,1,4,2,4,Price


# `apply` function

Python has a similar `apply` function. Let’s use some data with context to help you better understand the functions. Get the mean and standard deviation of all numerical variables in the dataset `SimDat`. 

In [0]:
# Select numerical variables (i.e. exclude object type)
SubDat = SimDat.select_dtypes(include = ['int64', 'float64'])
# Or exclude object type
# SubDat = SimDat.select_dtypes(exclude = ['object'])

The data frame `SubDat` only includes numeric columns. Now we can go head and use `apply` function to get mean and standard deviation for each column:

In [5]:
# axis = 0 or ‘index’: apply function to each column.
# axis = 1 or ‘columns’: apply function to each row.
# Get the mean
SubDat.apply(np.mean, axis = 0)
# Get the standard deviation
SubDat.apply(np.std, axis = 0)

age                16.408607
income          49811.737217
store_exp        2773.012238
online_exp       1730.358479
store_trans         3.693711
online_trans        7.952980
Q1                  1.449413
Q2                  1.167763
Q3                  1.401405
Q4                  1.154483
Q5                  1.283735
Q6                  1.437809
Q7                  1.455213
Q8                  1.153769
Q9                  1.117933
Q10                 1.135606
dtype: float64

# Tidy and Reshape Data

We will illustrate the data manipulations in order:

- Display
- Subset
- Summarize
- Create new variable
- Merge
- Reshape data

## Display

`describe` function can generate descriptive statistics including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [6]:
SimDat.describe()

Unnamed: 0,age,income,store_exp,online_exp,store_trans,online_trans,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10
count,1000.0,816.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,38.84,113543.065222,1356.850523,2120.181187,5.35,13.546,3.101,1.823,1.992,2.763,2.945,2.448,3.434,2.396,3.085,2.32
std,16.416818,49842.287197,2774.399785,1731.224308,3.695559,7.956959,1.450139,1.168348,1.402106,1.155061,1.284377,1.438529,1.455941,1.154347,1.118493,1.136174
min,16.0,41775.637023,-500.0,68.817228,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,25.0,85832.393634,204.976456,420.341127,3.0,6.0,2.0,1.0,1.0,2.0,1.75,1.0,2.5,1.0,2.0,1.0
50%,36.0,93868.682835,328.980863,1941.855436,4.0,14.0,3.0,1.0,1.0,3.0,4.0,2.0,4.0,2.0,4.0,2.0
75%,53.0,124572.400926,597.293077,2440.774823,7.0,20.0,4.0,2.0,3.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0
max,300.0,319704.337941,50000.0,9479.44231,20.0,36.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


## Subset


### Subset rows

In [0]:
# Select rows that meet logical criteria. For example, get rows with income more than 300000
SimDat[SimDat.income > 300000]

# Select rows by position
SimDat.iloc[5:10]

# Select the first n rows
SimDat.head(3)

# select the last n rows
SimDat.tail(3)

# Randomly select fraction
SimDat.sample(frac = 0.5)

# or use n = 10 to randomly select n rows
SimDat.sample(n = 10)

# Delete duplicated rows.
SimDat.drop_duplicates()

### Subset columns



In [0]:
# Select multiple columns with specific names.
SimDat[['age','gender','income']]

# select Q1 to Q5
# create a list of names
nam_list =  ['Q' + str(i) for i in list(range(1,6)) ]
SimDat[nam_list]

# select one column
SimDat.gender
# or
SimDat['gender']

# select columns whose name contains a character string
# match column name contains "_"
SimDat.filter(regex = '_')

# select columns whose name starts with a character string
SimDat.filter(regex = '^Q')
# select columns whose name ends with a character string
SimDat.filter(regex = 'e$')

# select columns between age and online_exp
SimDat.loc[:, 'age':'online_exp']

# select all columns except for age
SimDat.drop(columns = ['age'])

## Summarize

A standard marketing problem is customer segmentation. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.

In [60]:
df = (SimDat
      .groupby('segment', as_index=False)
      .agg({'age': lambda x: round(x.mean(), 0),
            'gender': lambda x: round((x =='Female').mean(), 2),
            'house': lambda x: round((x =='Yes').mean(), 2),
            'online_exp': lambda x: round(stats.trim_mean(x, 0.1), 0),
            'store_trans': lambda x: round(x.mean(), 1),
            'online_trans': lambda x: round(x.mean(),1)
            })
      )
df

Unnamed: 0,segment,age,gender,house,online_exp,store_trans,online_trans
0,Conspicuous,42,0.32,0.86,4891.0,10.9,11.1
1,Price,60,0.45,0.94,205.0,6.1,3.0
2,Quality,35,0.47,0.34,2012.0,2.9,16.0
3,Style,24,0.81,0.27,1958.0,3.0,21.1


Now, let’s peel the onion.

`SimDat` is the data you want to work on. `groupby('segment', as_index=False)` tells python that in the following steps you want to summarise by variable `segment`. By default, for aggregated output, the function returns object with group labels as the index. By setting `as_index = False`, we treat the group labels as normal column values which is effectively “SQL-style” grouped output.

Here we only summarize data by one categorical variable, but you can group by multiple variables, such as `groupby(['segment','house'], as_index=False)`. 

`agg` aggregates the groups using function. For example, `'age': lambda x: round(x.mean(), 0)` tells python the following things:

- Calculate the mean of column age ignoring missing value for each customer segment
- Round the result to the specified number of decimal places

The rest of the commands are similar. In the end, we calculate the following for each segment:

- `age`: average age for each segment
- `gender`: percentage female for each segment
- `house`: percentage of people who own a house
- `stroe_exp`: average expense in store
- `online_exp`: average expense online
- `store_trans`: average times of transactions in the store
- `online_trans`: average times of online transactions
There is a lot of information you can extract from those simple averages.

You may notice that Style group purchase more frequently online (`online_trans`) but the expense (`online_exp`) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.



## Create new variable

The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let’s check average one-time online and in-store purchase:


In [69]:
df2 = (SimDat
      .groupby('segment', as_index=False)
      .agg({'online_exp': lambda x: round(x.sum(), 0),
            'store_exp': lambda x: round(x.sum(), 0),
            'online_trans': lambda x: round(x.sum(),0),
            'store_trans': lambda x: round(x.sum(), 0)          
            })
      )

# create new columns: 
df2['avg_online_exp'] = df2.online_exp/df2.online_trans
df2['avg_store_exp'] = df2.store_exp/df2.store_trans
df2[['segment','avg_online_exp', 'avg_store_exp']]

Unnamed: 0,segment,avg_online_exp,avg_store_exp
0,Conspicuous,442.274492,479.245404
1,Price,69.278003,81.303188
2,Quality,126.051972,105.115183
3,Style,92.833694,121.069549



Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. 

Another comman task is to check which column has missing values. It requires the program to look at each column in the data.

In [75]:
SimDat.isna().any()

age             False
gender          False
income           True
house           False
store_exp       False
online_exp      False
store_trans     False
online_trans    False
Q1              False
Q2              False
Q3              False
Q4              False
Q5              False
Q6              False
Q7              False
Q8              False
Q9              False
Q10             False
segment         False
dtype: bool

## Merge

We create two baby data sets to show how merge works.



In [79]:
dfx = pd.DataFrame({'ID': ["A", "B", "C"],
                   'x1': [1, 2, 3]})
dfx

Unnamed: 0,ID,x1
0,A,1
1,B,2
2,C,3


In [80]:
dfy = pd.DataFrame({'ID': ["B", "C", "D"],
                   'y1': [True, True, False]})
dfy

Unnamed: 0,ID,y1
0,B,True
1,C,True
2,D,False


In [83]:
# Join matching rows from dfy to dfx
pd.merge(dfx, dfy, how = 'left', on = 'ID')

Unnamed: 0,ID,x1,y1
0,A,1,
1,B,2,True
2,C,3,True


In [84]:
# Retain only rows in both sets
pd.merge(dfx, dfy, how = 'inner', on = 'ID')

Unnamed: 0,ID,x1,y1
0,B,2,True
1,C,3,True


In [85]:
# Retain all values, all rows
pd.merge(dfx, dfy, how = 'outer', on = 'ID')

Unnamed: 0,ID,x1,y1
0,A,1.0,
1,B,2.0,True
2,C,3.0,True
3,D,,False


## Reshape data

Take a baby subset of our exemplary clothes consumers data to illustrate:

In [25]:
# iloc selects by position 
sdat = SimDat.iloc[:,0:6].sample(100)
sdat

Unnamed: 0,age,gender,income,house,store_exp,online_exp
737,23,Female,81763.916849,No,205.666156,1040.896688
684,23,Female,89609.949268,No,203.225898,1734.342716
364,47,Male,140225.820678,No,4387.143342,5211.133625
80,61,Male,,Yes,358.173344,315.417781
612,35,Female,73300.171093,No,272.391989,2059.876395
...,...,...,...,...,...,...
430,46,Male,217731.681147,Yes,3593.886783,4398.094199
825,26,Male,96070.678386,Yes,191.649454,1663.719362
562,45,Female,73857.691171,No,306.127635,2048.518957
522,42,Female,71264.151240,No,424.963019,1893.200051


For the above data `sdat`, what if we want to have a variable indicating the purchasing channel (i.e. online or in-store) and another column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”. There are two general ways to shape data:

- Use `melt()` to convert an object into a molten data frame, i.e., from wide to long
- Use `pivot()` to cast a molten data frame into the shape you want, i.e., from long to wide

In [41]:
sdat_melt = pd.melt(sdat, 
        id_vars = ['age', 'gender', 'income',	'house'],  
        value_vars= ["store_exp", "online_exp"], 
        var_name = 'Channel',  
        value_name = 'Expense')

sdat_melt

Unnamed: 0,age,gender,income,house,Channel,Expense
0,23,Female,81763.916849,No,store_exp,205.666156
1,23,Female,89609.949268,No,store_exp,203.225898
2,47,Male,140225.820678,No,store_exp,4387.143342
3,61,Male,,Yes,store_exp,358.173344
4,35,Female,73300.171093,No,store_exp,272.391989
...,...,...,...,...,...,...
195,46,Male,217731.681147,Yes,online_exp,4398.094199
196,26,Male,96070.678386,Yes,online_exp,1663.719362
197,45,Female,73857.691171,No,online_exp,2048.518957
198,42,Female,71264.151240,No,online_exp,1893.200051


You melted the data frame `sdat` by two variables: `store_exp` and `online_exp` (`value_vars= ["store_exp", "online_exp"]`). The new variable name is `Channel` set by command `var_name = 'Channel'`. The value name is Expense set by command `value_name = 'Expense'`.

Sometimes we want to convert the data from “long” to “wide”. For example, you want to compare the online and in-store expense between male and female based on the house ownership.

In [51]:
sdat_pivot = (sdat_melt
              .pivot_table( values='Expense',
                           index=['gender','house'],
                           columns='Channel', 
                           aggfunc=np.mean)
  )
sdat_pivot

Unnamed: 0_level_0,Channel,online_exp,store_exp
gender,house,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,2017.491842,353.80076
Female,Yes,2378.192391,1166.969056
Male,No,2332.378214,741.731604
Male,Yes,1934.985999,1578.206157
