# Coding etiquette and excel reporting - Part 1

## Content list:

##### 01. Importing libraries
##### 02. Importing data
##### 03. PII screening
##### 04. Customer information by geographical region: 
###### a) Create lists using wikipedia's list of regions (i.e., Northeast, Midwest, South, West) 
###### b) Compare regions based on spending habits
##### 05. Flag for low-activity customers:
###### a) Create a low activity flag 
###### b) Exclude low-activity customers from analysis 
###### c) Create a dataframe with only the normal activity customers 
###### d) Export dataframe with new columns

## 01. Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 0.2 Importing data

In [2]:
path= r'C:\Users\isobr\OneDrive\02122022Instacart Basket Analysis'

In [3]:
path

'C:\\Users\\isobr\\OneDrive\\02122022Instacart Basket Analysis'

In [4]:
cust_prods_ords = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'cust_prods_ords.pkl'))

In [5]:
cust_prods_ords.shape

(32404859, 34)

In [6]:
cust_prods_ords.head(2)

Unnamed: 0,Unnamed: 0_x,product_id,product_name,aisle_id,department_id,prices,Unnamed: 0.1,Unnamed: 0_y,order_id,user_id,...,median_frequency,frequency_flag,surname,Gender,State,Age,date_joined,number_dependants,family_status,income
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,1987,1987,3139998,138,...,8.0,Frequent customer,Cox,Male,Minnesota,81,2019-08-01,1,married,49620
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1989,1989,1977647,138,...,8.0,Frequent customer,Cox,Male,Minnesota,81,2019-08-01,1,married,49620


## 03. PII screening

In [7]:
cust_prods_ords.columns

Index(['Unnamed: 0_x', 'product_id', 'product_name', 'aisle_id',
       'department_id', 'prices', 'Unnamed: 0.1', 'Unnamed: 0_y', 'order_id',
       'user_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day',
       'days_since_prior_order', 'add_to_cart_order', 'reordered',
       'price_range_loc', 'busiest_day', 'busier_days',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'average_price',
       'spender_flag', 'median_frequency', 'frequency_flag', 'surname',
       'Gender', 'State', 'Age', 'date_joined', 'number_dependants',
       'family_status', 'income'],
      dtype='object')

The column 'surname' is a problem for PII, the name column had already been deleted since it contained many missing values. Now we should drop the column name from the analysis to make sure clients cannot be identified.

However, since the data used in this exercise is fabricated for the purpose of the course work, PII laws are not applicable, and therefore we can keep the 'surname' column. 

## 0.4 Customer information by geographical region

In [None]:
# Check frequency for States
cust_prods_ords['State'].value_counts(dropna = False)

#### a)Create lists using wikipedia's list of regions (i.e., Northeast, Midwest, South, West)

In [9]:
#List for Northeast
northeast = ['Maine','New Hampshire','Vermont','Massachusetts','Rhode Island','Connecticut','New York','Pennsylvania','New Jersey']

In [10]:
#list for midwest states
midwest = ['Wisconsin','Michigan','Illinois','Indiana','Ohio','North Dakota','South Dakota','Nebraska','Kansas','Minnesota','Iowa','Missouri']

In [11]:
#list for south states
south = ['Delaware','Maryland','District of Columbia','Virginia','West Virginia','North Carolina','South Carolina','Georgia','Florida','Kentucky','Tennessee','Mississippi','Alabama','Oklahoma','Texas','Arkansas','Louisiana']

In [12]:
#List for west states
west = ['Idaho','Montana','Wyoming','Nevada','Utah','Colorado','Arizona','New Mexico','Alaska','Washington','Oregon','California','Hawaii']

In [13]:
# to create a new column, I will use a for loop function
result = []

for value in cust_prods_ords["State"]:
  if value in northeast:
    result.append("Northeast")
  elif value in midwest:
    result.append("Midwest")
  elif value in south:
    result.append("South")
  else:
    result.append("West")

In [14]:
#checking the new list - i.e. result
result

['Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Mi

In [15]:
#create a new column with the 'result'
cust_prods_ords ['region'] = result

In [16]:
#check frequency
cust_prods_ords ['region'].value_counts(dropna=False)

South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: region, dtype: int64

#### b)Compare regions based on spending habits

In [17]:
# comparing regions in terms of customer spending
crosstab_revenue = pd.crosstab(cust_prods_ords['region'], cust_prods_ords['spender_flag'], dropna = False)

In [18]:
crosstab_revenue

spender_flag,High spender,Low spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,7441393,155932
Northeast,5614556,108180
South,10582404,209481
West,8132642,160271


In [19]:
#copying crosstab table
crosstab_revenue.to_clipboard()

### 0.5 Flag for low activity customers

#### a) Create a low activity flag

In [20]:
#using a loc function to create new flag
cust_prods_ords.loc[cust_prods_ords['max_order']>=5,'activity_flag']='Normal activity'

In [21]:
cust_prods_ords.loc[cust_prods_ords['max_order']<5,'activity_flag']='Low activity'

In [22]:
cust_prods_ords['activity_flag'].value_counts(dropna=False)

Normal activity    30964564
Low activity        1440295
Name: activity_flag, dtype: int64

In [23]:
#making sure rows are matching
30964564+1440295

32404859

#### b) Exclude low-activity customers from analysis

In [24]:
# First, to separete the two values, I need to create two dataframes, one with the low-activity, one with the normal activity
cust_prods_ords_low = cust_prods_ords[cust_prods_ords['activity_flag']=='Low activity']

In [25]:
#checking if it worked
cust_prods_ords_low.head(3)

Unnamed: 0,Unnamed: 0_x,product_id,product_name,aisle_id,department_id,prices,Unnamed: 0.1,Unnamed: 0_y,order_id,user_id,...,surname,Gender,State,Age,date_joined,number_dependants,family_status,income,region,activity_flag
340,0,1,Chocolate Sandwich Cookies,61,19,5.8,12198,12198,652770,764,...,Myers,Female,Wisconsin,40,2020-02-09,3,married,31308,Midwest,Low activity
341,0,1,Chocolate Sandwich Cookies,61,19,5.8,12200,12200,1813452,764,...,Myers,Female,Wisconsin,40,2020-02-09,3,married,31308,Midwest,Low activity
342,3260,3260,Chips Ahoy!/Nutter Butter/Oreo Cookies,61,19,1.7,12198,12198,652770,764,...,Myers,Female,Wisconsin,40,2020-02-09,3,married,31308,Midwest,Low activity


In [26]:
#Exporting this dataframe
cust_prods_ords_low.to_pickle(os.path.join(path, '02 Data','Prepared data', 'low_activity_cust.pkl'))

#### c) Create a dataframe with only the normal activity customers

In [27]:
cust_prods_ords_normal = cust_prods_ords[cust_prods_ords['activity_flag']=='Normal activity']

In [28]:
#checking if it worked
cust_prods_ords_normal.head(3)

Unnamed: 0,Unnamed: 0_x,product_id,product_name,aisle_id,department_id,prices,Unnamed: 0.1,Unnamed: 0_y,order_id,user_id,...,surname,Gender,State,Age,date_joined,number_dependants,family_status,income,region,activity_flag
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,1987,1987,3139998,138,...,Cox,Male,Minnesota,81,2019-08-01,1,married,49620,Midwest,Normal activity
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1989,1989,1977647,138,...,Cox,Male,Minnesota,81,2019-08-01,1,married,49620,Midwest,Normal activity
2,907,907,Premium Sliced Bacon,106,12,20.0,1960,1960,3160996,138,...,Cox,Male,Minnesota,81,2019-08-01,1,married,49620,Midwest,Normal activity


#### d) Export dataframe with new columns

In [29]:
#exporting dataframe while excluding low-activity customers
cust_prods_ords_normal.to_pickle(os.path.join(path, '02 Data','Prepared data', 'normal_activity_cust.pkl'))