# Data Wrangling - La cirugía de los datos
Data Wrangling o data Munging, es el proceso de transformar y mapear datos de un dataset raw (en bruto) en otro formato con la intención de hacerlo más adecuado y valioso para una variedad de propósitos posteriores, como el análisis. Un data Wrangler es una persona que realiza estas operaciones de transformación.
Esto puede incluir Munging, visualización de datos, agrupación de datos, entrenamiento de un modelo estadístico, así como mucho otros usos potenciales. La oscilación de datos como proceso generalmente sigue un conjunto de pasos generales que comienzan extrayendo los datos en forma cruda de origen de datos, dividiendo los datos en bruto usando algoritmos (por ejemplo clasificación) o analizando los datos en estructuras predefinidas, y finalmente depositando el contenido resultante en un sistema de almacenamiento (o silo) para su uso futuro.

In [1]:
import pandas as pd

In [2]:
mainpath = "C:/Users/Esneider Infante/Documentos/Python Machine Learning Udemy/python-ml-course/datasets/"
filename = "customer-churn-model/Customer Churn Model.txt"
fullpath = mainpath+filename
data = pd.read_csv(fullpath)

In [3]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


## Crear un Subconjunto de Datos

In [4]:
accountLength = data['Account Length']

In [5]:
accountLength.head()

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [6]:
type(accountLength)

pandas.core.series.Series

In [7]:
subset = data[['Account Length','Phone','Eve Charge','Day Calls']]

In [8]:
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Day Calls
0,128,382-4657,16.78,110
1,107,371-7191,16.62,123
2,137,358-1921,10.3,114
3,84,375-9999,5.26,71
4,75,330-6626,12.61,113


In [9]:
type(subset)

pandas.core.frame.DataFrame

In [10]:
desired_columns = ['Account Length','Phone','Eve Charge','Day Calls','Night Calls']

In [11]:
subset = data[desired_columns]
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Day Calls,Night Calls
0,128,382-4657,16.78,110,91
1,107,371-7191,16.62,123,103
2,137,358-1921,10.3,114,104
3,84,375-9999,5.26,71,89
4,75,330-6626,12.61,113,121


In [12]:
desired_columns = ['Account Length','VMail Message','Day Calls']
all_columns_list = data.columns.values.tolist()

In [13]:
all_columns_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [14]:
sublist = [x for x in all_columns_list if x not in desired_columns]
sublist

['State',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'Day Mins',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

In [15]:
subset = data[sublist]
subset.head()

Unnamed: 0,State,Area Code,Phone,Int'l Plan,VMail Plan,Day Mins,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,415,382-4657,no,yes,265.1,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,415,371-7191,no,yes,161.6,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,415,358-1921,no,no,243.4,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,408,375-9999,yes,no,299.4,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,415,330-6626,yes,no,166.7,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [16]:
data[10:25]

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
10,IN,65,415,329-6603,no,no,0,129.1,137,21.95,...,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True.
11,RI,74,415,344-9403,no,no,0,187.7,127,31.91,...,148,13.89,196.0,94,8.82,9.1,5,2.46,0,False.
12,IA,168,408,363-1107,no,no,0,128.8,96,21.9,...,71,8.92,141.1,128,6.35,11.2,2,3.02,1,False.
13,MT,95,510,394-8006,no,no,0,156.6,88,26.62,...,75,21.05,192.3,115,8.65,12.3,5,3.32,3,False.
14,IA,62,415,366-9238,no,no,0,120.7,70,20.52,...,76,26.11,203.0,99,9.14,13.1,6,3.54,4,False.
15,NY,161,415,351-7269,no,no,0,332.9,67,56.59,...,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True.
16,ID,85,408,350-8884,no,yes,27,196.4,139,33.39,...,90,23.88,89.3,75,4.02,13.8,4,3.73,1,False.
17,VT,93,510,386-2923,no,no,0,190.7,114,32.42,...,111,18.55,129.6,121,5.83,8.1,3,2.19,3,False.
18,VA,76,510,356-2992,no,yes,33,189.7,66,32.25,...,65,18.09,165.7,108,7.46,10.0,5,2.7,1,False.
19,TX,73,415,373-2782,no,no,0,224.4,90,38.15,...,88,13.56,192.8,74,8.68,13.0,2,3.51,1,False.


In [17]:
data[:5] #es lo mismo que data[1:5]

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [18]:
## Usuarios con Day Mins>300
data1 = data[data['Day Mins']>300]
data1.shape

(43, 21)

In [19]:
## Usuarios de New York (State=NY)
data2 = data[data['State']=='NY']
data2.shape

(83, 21)

In [20]:
## AND -> &
data3 = data[(data['State']=='NY') & (data['Day Mins']>300)]
data3.shape

(2, 21)

In [21]:
## OR -> |
data4 = data[(data['State']=='NY') | (data['Day Mins']>300)]
data4.shape

(124, 21)

In [22]:
## Minutos de día, de noche y longitud de la cuenta de los primeros 50 individuos
subset_first50 = data[['Day Mins','Night Mins','Account Length']][:50]
#subset_first50.shape
subset_first50.head()

Unnamed: 0,Day Mins,Night Mins,Account Length
0,265.1,244.7,128
1,161.6,254.4,107
2,243.4,162.6,137
3,299.4,196.9,84
4,166.7,186.9,75


In [23]:
data.iloc[1:10,3:6] ## primeras 10 filas de las columnas 3 a 6

Unnamed: 0,Phone,Int'l Plan,VMail Plan
1,371-7191,no,yes
2,358-1921,no,no
3,375-9999,yes,no
4,330-6626,yes,no
5,391-8027,yes,no
6,355-9993,no,yes
7,329-9001,yes,no
8,335-4719,no,no
9,330-8173,yes,yes


In [24]:
data.iloc[:,3:6] #Todas las filas de las columnas 3 a 6
data.iloc[1:10,:] #Todas las columnas para las filas de la 1 a 10
data.iloc[1:10,[2,5,7]]

Unnamed: 0,Area Code,VMail Plan,Day Mins
1,415,yes,161.6
2,415,no,243.4
3,408,no,299.4
4,415,no,166.7
5,510,no,223.4
6,510,yes,218.2
7,415,no,157.0
8,408,no,184.5
9,415,yes,258.6


In [25]:
data.loc[[1,5,8,36],['Area Code','VMail Plan','Day Mins']]

Unnamed: 0,Area Code,VMail Plan,Day Mins
1,415,yes,161.6
5,510,no,223.4
8,408,no,184.5
36,408,yes,146.3


In [26]:
data['Total Mins'] = data['Day Mins']+data['Night Mins']+data['Eve Mins']

In [27]:
data['Total Mins'].head()

0    707.2
1    611.5
2    527.2
3    558.2
4    501.9
Name: Total Mins, dtype: float64

In [28]:
data['Total Calls'] = data['Day Calls']+data['Night Calls']+data['Eve Calls']

In [29]:
data['Total Calls'].head()

0    300
1    329
2    328
3    248
4    356
Name: Total Calls, dtype: int64

## Generación Aleatoria de números

In [31]:
import numpy as np

In [35]:
##Generar un numero aleatorio entre 0 y 100
np.random.randint(1,100)

49

In [39]:
## la forma más clasica de generar un numero aleatorio es entre 0 y 1
np.random.random()

0.03695880416981301

In [40]:
##Función que genera una lista de n números aleatorios enteros dentro del intervalo[a,b]
def randint_list(n,a,b):
    x=[]
    for i in range(n):
        x.append(np.random.randint(a,b))
    return x


In [41]:
randint_list(10,100,5000)

[130, 4301, 4513, 3889, 4645, 4674, 2003, 212, 3441, 4285]

In [42]:
# Libreria con funciones números aleatorios
import random

In [44]:
#Numeros aleatorios entre 0 y 100, multiplos de 7
for i in range(10):
    print(random.randrange(0,100,7))

84
42
14
84
7
21
49
42
42
42


#### Shuffling: Barajar

In [47]:
a = np.arange(100)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [49]:
random.shuffle(a)
a

array([33, 62, 54, 19, 21, 24, 47, 81, 18, 32, 78, 34, 38, 61,  1, 36, 40,
        0,  7, 66, 43, 75, 92, 22, 93, 27, 98, 17, 30, 52, 70, 87, 65, 41,
       91, 79, 69, 71, 85, 44, 12, 59, 77, 72, 55, 88, 23, 37, 49, 10, 26,
       42, 20, 13, 29, 76, 68, 97, 64, 14, 50, 80, 94, 95,  4,  8, 11, 16,
       48, 63, 51, 15, 73, 53, 28, 31, 46, 99, 83, 25, 45, 56,  3,  2,  6,
       39, 58, 90, 35, 67, 57, 89,  9, 74, 82, 84, 96,  5, 86, 60])

In [51]:
data.shape

(3333, 23)

#### Choice

In [53]:
column_list = data.columns.values.tolist()
column_list

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?',
 'Total Mins',
 'Total Calls']

In [54]:
random.choice(column_list)

'Eve Mins'

#### Seed

In [58]:
#Siempre los mismo números debido a la semilla
np.random.seed(2018)
for i in range(5):
    print(np.random.random())

0.8823493117539459
0.10432773786047767
0.9070093335163405
0.3063988986063515
0.446408872427422
