# Data Wrangling - Data Surgery

<b>Data wrangling</b>, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

The process of data wrangling may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use

In [22]:
import pandas as pd

In [23]:
data = pd.read_csv("../../datasets/customer-churn-model/Customer Churn Model.txt")

In [24]:
data.head(5)

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


### Create a data subset

In [25]:
account_length = data["Account Length"]

In [26]:
account_length.head(5)

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [27]:
type(account_length)

pandas.core.series.Series

In [28]:
subset = data[["Account Length", "Phone", "Eve Charge", "Day Calls"]]

In [29]:
subset.head(5)

Unnamed: 0,Account Length,Phone,Eve Charge,Day Calls
0,128,382-4657,16.78,110
1,107,371-7191,16.62,123
2,137,358-1921,10.3,114
3,84,375-9999,5.26,71
4,75,330-6626,12.61,113


In [30]:
type(subset)

pandas.core.frame.DataFrame

In [31]:
desired_columns = ["Account Length", "Phone", "Eve Charge", "Night Calls"]
subset = data[desired_columns]
subset.head(5)

Unnamed: 0,Account Length,Phone,Eve Charge,Night Calls
0,128,382-4657,16.78,91
1,107,371-7191,16.62,103
2,137,358-1921,10.3,104
3,84,375-9999,5.26,89
4,75,330-6626,12.61,121


In [32]:
desired_columns = ["Account Length", "VMail Message", "Day Calls"]
desired_columns

['Account Length', 'VMail Message', 'Day Calls']

In [33]:
all_columns_list = data.columns.values.tolist()
all_columns_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [34]:
sublist = [x for x in all_columns_list if x not in desired_columns] #Extract subset
sublist

['State',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'Day Mins',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [35]:
subset = data[sublist]
subset.head(5)

Unnamed: 0,State,Area Code,Phone,Int'l Plan,VMail Plan,Day Mins,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,415,382-4657,no,yes,265.1,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,415,371-7191,no,yes,161.6,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,415,358-1921,no,no,243.4,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,408,375-9999,yes,no,299.4,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,415,330-6626,yes,no,166.7,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [36]:
#Other Method (by sets substraction)
a = set(desired_columns)
b = set(all_columns_list)
sublist2 = b-a
sublist2 = list(sublist2)
subset2 = data[sublist2]
subset2.head(5)

Unnamed: 0,Eve Mins,Eve Charge,Day Mins,Night Mins,Int'l Plan,Intl Calls,Night Calls,State,Night Charge,VMail Plan,Area Code,Intl Mins,CustServ Calls,Eve Calls,Phone,Churn?,Intl Charge,Day Charge
0,197.4,16.78,265.1,244.7,no,3,91,KS,11.01,yes,415,10.0,1,99,382-4657,False.,2.7,45.07
1,195.5,16.62,161.6,254.4,no,3,103,OH,11.45,yes,415,13.7,1,103,371-7191,False.,3.7,27.47
2,121.2,10.3,243.4,162.6,no,5,104,NJ,7.32,no,415,12.2,0,110,358-1921,False.,3.29,41.38
3,61.9,5.26,299.4,196.9,yes,7,89,OH,8.86,no,408,6.6,2,88,375-9999,False.,1.78,50.9
4,148.3,12.61,166.7,186.9,yes,3,121,OK,8.41,no,415,10.1,3,122,330-6626,False.,2.73,28.34


In [37]:
data[10:15]

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
10,IN,65,415,329-6603,no,no,0,129.1,137,21.95,...,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True.
11,RI,74,415,344-9403,no,no,0,187.7,127,31.91,...,148,13.89,196.0,94,8.82,9.1,5,2.46,0,False.
12,IA,168,408,363-1107,no,no,0,128.8,96,21.9,...,71,8.92,141.1,128,6.35,11.2,2,3.02,1,False.
13,MT,95,510,394-8006,no,no,0,156.6,88,26.62,...,75,21.05,192.3,115,8.65,12.3,5,3.32,3,False.
14,IA,62,415,366-9238,no,no,0,120.7,70,20.52,...,76,26.11,203.0,99,9.14,13.1,6,3.54,4,False.


In [38]:
#Get Users with Day Mins > 300
data1 = data[data["Day Mins"] > 300]
data1.shape

(43, 21)

In [39]:
#Get users from NY
data2 = data[data["State"] == "NY"]
data2.shape

(83, 21)

In [40]:
#With operator AND -> &
data3 = data[(data["Day Mins"] > 300) & (data["State"] == "NY")]
data3.shape

(2, 21)

In [41]:
#With Operator OR -> |
data4 = data[(data["Day Mins"] > 300) | (data["State"] == "NY")]
data4.shape

(124, 21)

In [42]:
data5 = data[data["Day Calls"] < data["Night Calls"]]
data5.shape

(1626, 21)

In [43]:
data6 = data[data["Day Mins"] < data["Night Mins"]]
data6.shape

(2051, 21)

In [44]:
# Get Day Mins, Night Mins and Account Length for the first 50 users
subset_first_50 = data[["Day Mins", "Night Mins", "Account Length"]][:50]
subset_first_50.shape

(50, 3)

In [45]:
# iloc help us to get rows and columns by positional indexes from a dataset
data.iloc[1:10, 3:6] # Rows first, then columns

Unnamed: 0,Phone,Int'l Plan,VMail Plan
1,371-7191,no,yes
2,358-1921,no,no
3,375-9999,yes,no
4,330-6626,yes,no
5,391-8027,yes,no
6,355-9993,no,yes
7,329-9001,yes,no
8,335-4719,no,no
9,330-8173,yes,yes


In [46]:
# Filtering with iloc
data.iloc[[1,2,3,5,8], [1,4,6]]

Unnamed: 0,Account Length,Int'l Plan,VMail Message
1,107,no,26
2,137,no,0
3,84,yes,0
5,118,yes,0
8,117,no,0


In [47]:
data.iloc[:, 1:4] # Get every row with columns between 1 and 4

Unnamed: 0,Account Length,Area Code,Phone
0,128,415,382-4657
1,107,415,371-7191
2,137,415,358-1921
3,84,408,375-9999
4,75,415,330-6626
...,...,...,...
3328,192,415,414-4276
3329,68,415,370-3271
3330,28,510,328-8230
3331,184,510,364-6381


In [48]:
data.iloc[1:5, :] # Get all columns with rows between 1 and 5

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [49]:
# loc help us get columns or wors by label based indexing
data.loc[1:10, ["Area Code", "Day Mins", "VMail Plan"]]

Unnamed: 0,Area Code,Day Mins,VMail Plan
1,415,161.6,yes
2,415,243.4,no
3,408,299.4,no
4,415,166.7,no
5,510,223.4,no
6,510,218.2,yes
7,415,157.0,no
8,408,184.5,no
9,415,258.6,yes
10,415,129.1,no


In [50]:
data["Total Mins"] = data["Day Mins"] + data["Night Mins"] + data["Eve Mins"]
data["Total Mins"].head(5)

0    707.2
1    611.5
2    527.2
3    558.2
4    501.9
Name: Total Mins, dtype: float64

In [51]:
data["Total Calls"] = data["Day Calls"] + data["Night Calls"] + data["Eve Calls"]
data["Total Calls"].head(5)

0    300
1    329
2    328
3    248
4    356
Name: Total Calls, dtype: int64

In [52]:
data.shape

(3333, 23)

In [53]:
data.head(5)

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?,Total Mins,Total Calls
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,244.7,91,11.01,10.0,3,2.7,1,False.,707.2,300
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,254.4,103,11.45,13.7,3,3.7,1,False.,611.5,329
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,162.6,104,7.32,12.2,5,3.29,0,False.,527.2,328
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,196.9,89,8.86,6.6,7,1.78,2,False.,558.2,248
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,186.9,121,8.41,10.1,3,2.73,3,False.,501.9,356


## Generating Random Numbers

In [54]:
import numpy as np

In [55]:
np.random.randint(1, 100)

84

In [56]:
# The most classic way to generate random numbers is between 0 and 1 (with decimals)
np.random.random()

0.2825726541438438

In [57]:
# Function that generates a list of random integers within an interval (a, b)
def randint_list(n, a, b):
    x = []
    for i in range(n):
        x.append(np.random.randint(a, b))
    return x

In [58]:
randint_list(10, 1, 50)

[40, 21, 33, 34, 25, 26, 37, 8, 12, 41]

In [59]:
import random

In [60]:
random.randrange(1, 100, 7) # multiples of 7

22

#### Shuffling

In [61]:
a = np.arange(100)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [62]:
np.random.shuffle(a)
a

array([88, 61,  7, 11, 84,  3, 52, 19, 25, 82, 13, 59, 73, 89, 98, 42, 17,
       22, 71, 75, 87, 21, 15, 43, 45, 85,  8, 18, 67, 57, 28,  2, 81, 14,
       33, 30, 50, 34, 56, 48, 63, 47, 23, 55, 77,  6,  0, 32, 10, 27,  9,
       20, 93, 16, 99, 53,  4,  1, 29, 37, 26, 60, 83, 36, 51, 46, 68, 65,
       38, 58, 39, 96,  5, 70, 66, 97, 62, 35, 92, 44, 24, 78, 90, 40, 76,
       94, 79, 86, 74, 91, 49, 12, 95, 41, 31, 72, 80, 64, 69, 54])

In [63]:
column_list = data.columns.values.tolist()
column_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?',
 'Total Mins',
 'Total Calls']

In [64]:
np.random.choice(column_list)

'VMail Plan'

#### Seed

In [65]:
np.random.seed(2018)
for i in range(5):
    print(np.random.random())

0.8823493117539459
0.10432773786047767
0.9070093335163405
0.3063988986063515
0.446408872427422
