# Data Wrangling

*FYI* : Harvard defines data wrangling as: Data wrangling—also called data cleaning, data remediation, or data munging—refers to a variety of processes designed to transform raw data into more readily used formats. It's the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. 

In this notebook, I'll be doing some data cleaning and feature engineering.

One main objective is to create new data columns based on what we already have to faciliate the use of filtering columns to create insight. 
 - One way to do this is to split the "phone_model" column into just the brand, phone model, and generation. 
 - Another way to do this is to make the datetime variable "weeks_monday" indicate if the week is the 1st, 2nd, 3rd, or 4th week of the month. This could come in handy with trend analysis.

For cleaning, some tasks must be done.
 - Some data points don't contain information on the gb specification, and they need to be deleted
 - The "weeks_monday" variable needs to be changed to just a date (no time needed)


## Loading in data and packages

In [1]:
#load in needed packages
import pandas as pd
import numpy as np

In [2]:
%cd ../../../../data/p_dsi/teams2023/asurion_data/

/gpfs52/data/p_dsi/teams2023/asurion_data


In [3]:
#read in data from xlsx file
df = pd.read_excel('Asurion_data.xlsx')
df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday
0,alcatel axel,32gb,black,1,2022-01-24
1,alcatel axel,32gb,black,3,2022-11-14
2,alcatel axel,32gb,black,1,2022-11-21
3,alcatel axel,32gb,black,3,2022-11-28
4,alcatel axel,32gb,black,1,2022-12-12
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15


## Data Cleaning

In [4]:
df.weeks_monday = pd.to_datetime(df.weeks_monday, format='%Y-%m-%d')
df['Year'] = df['weeks_monday'].dt.year
df['Month'] = df['weeks_monday'].dt.month

In [5]:
df.dtypes

phone model             object
phone size              object
phone color             object
claim                    int64
weeks_monday    datetime64[ns]
Year                     int64
Month                    int64
dtype: object

## Examining the Data

In [6]:
#give information on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26661 entries, 0 to 26660
Data columns (total 7 columns):
phone model     26661 non-null object
phone size      26661 non-null object
phone color     26661 non-null object
claim           26661 non-null int64
weeks_monday    26661 non-null datetime64[ns]
Year            26661 non-null int64
Month           26661 non-null int64
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 1.4+ MB


In [7]:
#display the first 20 rows of the data
df.head(20)

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month
0,alcatel axel,32gb,black,1,2022-01-24,2022,1
1,alcatel axel,32gb,black,3,2022-11-14,2022,11
2,alcatel axel,32gb,black,1,2022-11-21,2022,11
3,alcatel axel,32gb,black,3,2022-11-28,2022,11
4,alcatel axel,32gb,black,1,2022-12-12,2022,12
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8


In [8]:
#display the last 20 rows of the data
df.tail(20)

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month
26641,samsung galaxy z fold3 5g,256gb,silver,4,2022-10-17,2022,10
26642,samsung galaxy z fold3 5g,256gb,silver,26,2022-10-24,2022,10
26643,samsung galaxy z fold3 5g,256gb,silver,18,2022-10-31,2022,10
26644,samsung galaxy z fold3 5g,256gb,silver,12,2022-11-07,2022,11
26645,samsung galaxy z fold3 5g,256gb,silver,17,2022-11-14,2022,11
26646,samsung galaxy z fold3 5g,256gb,silver,14,2022-11-21,2022,11
26647,samsung galaxy z fold3 5g,256gb,silver,11,2022-11-28,2022,11
26648,samsung galaxy z fold3 5g,256gb,silver,11,2022-12-05,2022,12
26649,samsung galaxy z fold3 5g,256gb,silver,1,2022-12-12,2022,12
26650,samsung galaxy z fold3 5g,256gb,silver,10,2022-12-19,2022,12


In [9]:
#display the number of rows and columns in the data
df.shape

(26661, 7)

In [10]:
#display the types of data in each column
df.dtypes

phone model             object
phone size              object
phone color             object
claim                    int64
weeks_monday    datetime64[ns]
Year                     int64
Month                    int64
dtype: object

In [11]:
#display the number of unique values in each column
df.nunique()

phone model     128
phone size        9
phone color      24
claim           431
weeks_monday     86
Year              3
Month            12
dtype: int64

Let's look at just the weeks_monday variable.

In [12]:
min(df['weeks_monday'])

Timestamp('2021-06-28 00:00:00')

In [13]:
max(df['weeks_monday'])

Timestamp('2023-02-13 00:00:00')

We can see that the dataset spans from June 2021 to February 2023.

## Creating new columns

### New columns from "phone_model"

#### The goal of this section is to create 3 new columns: brand, model, and generation. This will assist us in filtering the phone models in our analysis. 

In [14]:
#create a new column in the data called "brand"
df["brand"] = df["phone model"].str.split(" ", n = 1, expand = True)[0]

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel


In [15]:
#create a new column in the data called "model"
df["model"] = df["phone model"].str.split(" ", n = 1, expand = True)[1]

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip 4052r
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip 4052r
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip 4052r
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip 4052r
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip 4052r


In [16]:
#create a new column in the data called "generation"
df["generation"] = df["model"].str.split(" ", n = 1, expand = True)[1]

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel,
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel,
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel,
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel,
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel,
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip 4052r,4052r
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip 4052r,4052r
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip 4052r,4052r
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip 4052r,4052r
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip 4052r,4052r


We need to trim the model so it doesn't have the generation.

In [17]:
#in the column "model", delete the generation information
df["model"] = df["model"].str.split(" ", n = 1, expand = True)[0]

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel,
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel,
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel,
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel,
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel,
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip,4052r
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip,4052r
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip,4052r
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip,4052r
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip,4052r


We need to trim the generation so it doesn't have the model.

In [18]:
#in the column "generation", delete the model information
df["generation"] = df["generation"].str.split(" ", n = 1, expand = True)[0]

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel,
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel,
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel,
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel,
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel,
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip,4052r
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip,4052r
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip,4052r
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip,4052r
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip,4052r


Let's look at the dataframe to see how this worked.

In [19]:
#look at the middle of the data
df[100:200]


Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation
100,apple iphone 11,128gb,black,220,2022-10-17,2022,10,apple,iphone,11
101,apple iphone 11,128gb,black,222,2022-10-24,2022,10,apple,iphone,11
102,apple iphone 11,128gb,black,244,2022-10-31,2022,10,apple,iphone,11
103,apple iphone 11,128gb,black,228,2022-11-07,2022,11,apple,iphone,11
104,apple iphone 11,128gb,black,247,2022-11-14,2022,11,apple,iphone,11
105,apple iphone 11,128gb,black,193,2022-11-21,2022,11,apple,iphone,11
106,apple iphone 11,128gb,black,249,2022-11-28,2022,11,apple,iphone,11
107,apple iphone 11,128gb,black,209,2022-12-05,2022,12,apple,iphone,11
108,apple iphone 11,128gb,black,233,2022-12-12,2022,12,apple,iphone,11
109,apple iphone 11,128gb,black,218,2022-12-19,2022,12,apple,iphone,11


Yay! It worked.

### Creating new columns from "weeks_monday"

#### First, let's create a variable called "week_of_month" which will indicate which Monday of the month the date is. This could be helpful in our predictions.

In [20]:
#create a new column in the data called "week_of_month" from the "weeks_monday" column which says which counts the monday of each week
df["week_of_month"] = df["weeks_monday"].dt.day.apply(lambda x: (x-1)//7+1)

df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation,week_of_month
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel,,4
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel,,2
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel,,3
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel,,4
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel,,2
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip,4052r,1
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip,4052r,4
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip,4052r,1
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip,4052r,2
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip,4052r,3


Okay, let's look at the data to see if this worked.

In [21]:
#look at the middle of the data
df[150:210]


Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation,week_of_month
150,apple iphone 11,128gb,green,42,2022-02-07,2022,2,apple,iphone,11,1
151,apple iphone 11,128gb,green,52,2022-02-14,2022,2,apple,iphone,11,2
152,apple iphone 11,128gb,green,36,2022-02-21,2022,2,apple,iphone,11,3
153,apple iphone 11,128gb,green,47,2022-02-28,2022,2,apple,iphone,11,4
154,apple iphone 11,128gb,green,53,2022-03-07,2022,3,apple,iphone,11,1
155,apple iphone 11,128gb,green,52,2022-03-14,2022,3,apple,iphone,11,2
156,apple iphone 11,128gb,green,56,2022-03-21,2022,3,apple,iphone,11,3
157,apple iphone 11,128gb,green,53,2022-03-28,2022,3,apple,iphone,11,4
158,apple iphone 11,128gb,green,55,2022-04-04,2022,4,apple,iphone,11,1
159,apple iphone 11,128gb,green,40,2022-04-11,2022,4,apple,iphone,11,2


#### Next, let's create "month" and "year" variables from "weeks_monday."


In [22]:
#create a new column in the data called "month" from the "weeks_monday" column
df["month"] = df["weeks_monday"].dt.month


In [23]:
#create a new column in the data called "year" from the "weeks_monday" column
df["year"] = df["weeks_monday"].dt.year

In [24]:
df

Unnamed: 0,phone model,phone size,phone color,claim,weeks_monday,Year,Month,brand,model,generation,week_of_month,month,year
0,alcatel axel,32gb,black,1,2022-01-24,2022,1,alcatel,axel,,4,1,2022
1,alcatel axel,32gb,black,3,2022-11-14,2022,11,alcatel,axel,,2,11,2022
2,alcatel axel,32gb,black,1,2022-11-21,2022,11,alcatel,axel,,3,11,2022
3,alcatel axel,32gb,black,3,2022-11-28,2022,11,alcatel,axel,,4,11,2022
4,alcatel axel,32gb,black,1,2022-12-12,2022,12,alcatel,axel,,2,12,2022
5,alcatel smartflip 4052r,4gb,black,1,2022-07-04,2022,7,alcatel,smartflip,4052r,1,7,2022
6,alcatel smartflip 4052r,4gb,black,1,2022-07-25,2022,7,alcatel,smartflip,4052r,4,7,2022
7,alcatel smartflip 4052r,4gb,black,3,2022-08-01,2022,8,alcatel,smartflip,4052r,1,8,2022
8,alcatel smartflip 4052r,4gb,black,1,2022-08-08,2022,8,alcatel,smartflip,4052r,2,8,2022
9,alcatel smartflip 4052r,4gb,black,1,2022-08-15,2022,8,alcatel,smartflip,4052r,3,8,2022


## Exporting Data

In [25]:
#export the data to a xlsx file
df.to_excel(r"C:\Users\Rachel Montgomery\Documents\semester2\teams\asurion\wrangled_data.xlsx", index = False)

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Rachel Montgomery\\Documents\\semester2\\teams\\asurion\\wrangled_data.xlsx'

In [None]:
%cd ../../../../data/p_dsi/teams2023/team1

In [None]:
#export the data to a xlsx file
df.to_excel('wrangled_data.xlsx', index = False)