# Chapter I - Data Manipulation with Pandas
## Part I - Data Transformation with Pandas

### Question 1
1. Impute missing values for **Income** with given **Education** and **Marital_Status** group.
- dfIncomeMean is the dataframe that contains the mean of income grouped by Education and Marital_Status
- Using the apply method to the entire dataframe (for each row -> axis = 1), for those Null Incomes, we access to its group average income
2. Create column **DaysSinceEnrollment**
- Convert Dt_Customer to pandas 'datetime64' object
- Calculate the timedelta between dates using dateime library
- Access 'days' attribute in the 'timedelta64' object (int64)
3. Column **Generation**
- Use of function *generationClass* to classify user based on Year_Birth (int)
- Using the apply method and a lambda function that calls the *generationClass* function, we assign the generation to each user
- Those users born before 1928 were assigned *None* - keep only those with non-Null values.
4. Sort DataFrame by **ID**
- Use *sort_values* method

In [1]:
import pandas as pd
from datetime import date

# Import dataset
df = pd.read_csv('customers.csv')

# Q1
dfIncomeMean = df[['Education','Marital_Status','Income']].groupby(by=['Education','Marital_Status']).mean()
df.loc[df['Income'].isnull(),['Income']] = df.apply(lambda x: dfIncomeMean.loc[x.Education, x.Marital_Status].Income, axis=1)

# Q2
df = df.astype({'Dt_Customer':'datetime64'})
df['DaysSinceEnrollment'] = date(2017,1,1) - df['Dt_Customer'].dt.date
df['DaysSinceEnrollment'] = df['DaysSinceEnrollment'].dt.days

# Q3
def generationClass(year):
    if year >= 1981: return 'Millenial'
    elif 1981 > year >= 1965: return 'GenX'
    elif 1965 > year >= 1946: return 'BabyBoomer'
    elif 1946 > year >= 1928: return 'Silent'
    else: return None

df['Generation'] = df['Year_Birth'].apply(lambda x: generationClass(x))
df = df.loc[df['Generation'].isnull() == False,:]

# Q4
df = df.sort_values(by='ID', ascending = False)
df.head()


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,DaysSinceEnrollment,Generation
327,11191,1986,Graduation,Divorced,41411.0,0,0,2013-12-07,11,37,...,3,18,1,2,1,4,6,0,1121,Millenial
980,11188,1957,Graduation,Together,26091.0,1,1,2014-02-25,84,15,...,17,20,3,2,1,3,5,0,1041,BabyBoomer
2152,11187,1978,Basic,Single,26487.0,1,0,2013-05-20,23,2,...,14,23,3,2,1,3,5,0,1322,GenX
84,11178,1972,Master,Single,42394.0,1,0,2014-03-23,69,15,...,1,4,1,1,0,3,7,0,1015,GenX
798,11176,1970,PhD,Together,65968.0,0,1,2014-05-12,12,376,...,4,4,2,5,4,7,3,0,965,GenX


### Save DataFrame to CSV

In [2]:
df.to_csv("Group4_Assignment2_C1_Part1.csv", sep=',', index=False)