# Data Wrangling

<p>This may include further <a href="/wiki/Mung_(computer_term)" title="Mung (computer term)">munging</a>, <a href="/wiki/Data_visualization" title="Data visualization">data visualization</a>, data aggregation, training a <a href="/wiki/Statistical_model" title="Statistical model">statistical model</a>, as well as many other potential uses.  Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.<sup id="cite_ref-eduunix_1-0" class="reference"><a href="#cite_note-eduunix-1"></a></sup>
</p>

In [None]:
import pandas as pd

In [None]:
import os
from pathlib import Path 

customer_churn_dataset = Path(os.path.abspath(os.path.curdir)) / 'data' / 'customer-churn-model' / 'Customer Churn Model.txt'

In [None]:
data = pd.read_csv(customer_churn_dataset)

In [None]:
data.head()

### Create a subset of data

#### Subset of a single Series

In [None]:
account_length = data["Account Length"]

In [None]:
account_length.head()

In [None]:
type(account_length)

In [None]:
subset = data[["Account Length", "Phone", "Eve Charge", "Day Calls"]]

In [None]:
subset.head()

In [None]:
type(subset)

In [None]:
desired_columns = ["Account Length", "Phone", "Eve Charge", "Night Calls"]
subset = data[desired_columns]
subset.head()

In [None]:
desired_columns = ["Account Length", "VMail Message", "Day Calls"]
desired_columns

In [None]:
all_columns_list = data.columns.values.tolist()
all_columns_list

In [None]:
sublist = [x for x in all_columns_list if x not in desired_columns]
sublist

In [None]:
subset = data[sublist]
subset.head()

#### Subset of Rows - Slicing

The operation of selecting multiple rows in the Data Frame is sometimes called Slicing

In [None]:
data[1:25]

In [None]:
data[10:35]

In [None]:
data[:8] # equivalent to data[1:8]

In [None]:
data[3320:]

#### Row Slicing with boolean conditions

In [None]:
# Selecting values with Day Mins > 300
data1 = data[data["Day Mins"]>300]
data1.shape

In [None]:
# Selecting values with State = "NY"
data2 = data[data["State"]=="NY"]
data2.shape

In [None]:
## AND -> &
data3 = data[(data["Day Mins"]>300) & (data["State"]=="NY")]
data3.shape

In [None]:
## OR -> |
data4 = data[(data["Day Mins"]>300) | (data["State"]=="NY")]
data4.shape

In [None]:
data5 = data[data["Day Calls"]< data["Night Calls"]]
data5.shape

In [None]:
data6 = data[data["Day Mins"]<data["Night Mins"]]
data6.shape

In [None]:

subset_first_50 = data[["Day Mins", "Night Mins", "Account Length"]][:50]
subset_first_50.head()

In [None]:
subset[:10]

#### Filtrado con ix -> loc e iloc

In [None]:
data.iloc[1:10, 3:6] ## Primeras 10 filas, columnas de la 3 a la 6

In [None]:
data.iloc[:,3:6] # all rows, third to sixth columns
data.iloc[1:10,:] # All cols, rows from 1 to 10

In [None]:
data.iloc[1:10, [2,5,7]]  # selecting specific columns

In [None]:
data.iloc[[1,5,8,36], [2,5,7]]

In [None]:
data.loc[[1,5,8,36], ["Area Code", "VMail Plan", "Day Mins"]]

## Inserting new colums in a Data Frame

In [None]:
data["Total Mins"] = data["Day Mins"] + data["Night Mins"] + data["Eve Mins"]

In [None]:
data["Total Mins"].head()

In [None]:
data["Total Calls"] = data["Day Calls"] + data["Night Calls"] + data["Eve Calls"]

In [None]:
data["Total Calls"].head()

In [None]:
data.shape

In [None]:
data.head()