### Data Acquisition Exercises

In [1]:
import os

import pandas as pd
import seaborn as sns

from pydataset import data

from env import get_db_url

4. In a jupyter notebook, `classification_exercises.ipynb`, use a python module (pydata or seaborn datasets) containing datasets as a source from the iris data. Create a pandas dataframe, `df_iris`, from this data.

    - print the first 3 rows
    - print the number of rows and columns (shape)
    - print the column names
    - print the data type of each column
    - print the summary statistics for each of the numeric variables. Would you
      recommend rescaling the data based on these statistics?

In [2]:
df_iris = data('iris')
df_iris.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


In [3]:
df_iris.shape

(150, 5)

In [4]:
list(df_iris.columns)

['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

In [5]:
df_iris.dtypes

Sepal.Length    float64
Sepal.Width     float64
Petal.Length    float64
Petal.Width     float64
Species          object
dtype: object

In [6]:
df_iris.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sepal.Length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
Sepal.Width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
Petal.Length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
Petal.Width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


5. Read the data from [this google sheet](https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit?usp=sharing) into a dataframe, `df_google`

    - print the first 3 rows
    - print the number of rows and columns
    - print the column names
    - print the data type of each column
    - print the summary statistics for each of the numeric variables
    - print the unique values for each of your categorical variables

In [7]:
sheet_url = \
'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'
sheet_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
df_google = pd.read_csv(sheet_url)

In [8]:
df_google.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [9]:
df_google.shape

(891, 12)

In [10]:
list(df_google.columns)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [11]:
df_google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [12]:
df_google.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [13]:
df_google.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [14]:
#print the unique values for each of your categorical variables
#list(df_google.Name.unique())
df_google.Name.nunique()

891

In [15]:
list(df_google.Sex.unique())

['male', 'female']

In [16]:
df_google.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [17]:
#df_google.Ticket.unique()
df_google.Ticket.nunique()

681

In [18]:
#df_google.Cabin.unique()
df_google.Cabin.nunique()

147

In [19]:
list(df_google.Embarked.unique())

['S', 'C', 'Q', nan]

In [20]:
df_google.Embarked.value_counts(dropna = False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

6. Download the previous exercise's file into an excel (File → Download → Microsoft Excel). Read the downloaded file into a dataframe named ```df_excel```.

    - assign the first 100 rows to a new dataframe, `df_excel_sample`
    - print the number of rows of your original dataframe
    - print the first 5 column names
    - print the column names that have a data type of `object`
    - compute the range for each of the numeric variables.
    

In [21]:
df_excel = pd.read_excel('train.xlsx')
df_excel.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,A/5 21171,7.25,,S


In [22]:
df_excel_sample = df_excel[:100]
df_excel_sample.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,A/5 21171,7.25,,S
1,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1.0,0.0,PC 17599,71.2833,C85,C
2,3.0,1.0,3.0,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,STON/O2. 3101282,7.925,,S
3,4.0,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803.0,53.1,C123,S
4,5.0,0.0,3.0,"Allen, Mr. William Henry",male,35.0,0.0,0.0,373450.0,8.05,,S


In [23]:
df_excel.index.size

891

In [24]:
df_excel_sample.index.size

100

In [25]:
df_excel.columns[:5]

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex'], dtype='object')

In [26]:
#print the column names that have a data type of object
dtypes_excel = df_excel.dtypes.reset_index()
dtypes_excel

Unnamed: 0,index,0
0,PassengerId,float64
1,Survived,float64
2,Pclass,float64
3,Name,object
4,Sex,object
5,Age,float64
6,SibSp,float64
7,Parch,float64
8,Ticket,object
9,Fare,float64


In [27]:
list(dtypes_excel[dtypes_excel[0] == 'object']['index'])

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [28]:
#print the column names that have a data type of object
#class solution
df_excel.select_dtypes(include='object').head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803.0,C123,S
4,"Allen, Mr. William Henry",male,373450.0,,S


In [29]:
df_excel.select_dtypes(include='object').columns.to_list()

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [30]:
df_excel.Fare.dtype

dtype('float64')

In [31]:
#compute the range for each of the numeric variables.

print('{:<20}|{:>7}'.format('Variable', 'Range'))
print('__________________________\n')
for col in df_excel.columns:
    if df_excel[col].dtype != 'O':
        col_series = df_excel[col]
        #print(f'Range of values in {col} is {col_series.max() - col_series.min()}')
        print('{:<20}|{:>7}'.format(col, round(col_series.max() - col_series.min(), 2)))

Variable            |  Range
__________________________

PassengerId         |  890.0
Survived            |    1.0
Pclass              |    2.0
Age                 |  79.58
SibSp               |    8.0
Parch               |    6.0
Fare                | 512.33


Make a new python module, `acquire.py` to hold the following data aquisition functions:

7. Make a function named `get_titanic_data` that returns the titanic data from the codeup data science database as a pandas data frame. Obtain your data from the _Codeup Data Science Database_. 


8. Make a function named `get_iris_data` that returns the data from the `iris_db` on the codeup data science database as a pandas data frame. The returned data frame should include the actual name of the species in addition to the `species_id`s. Obtain your data from the _Codeup Data Science Database_. 

9. Make a function named `get_telco_data` that returns the data from the `telco_churn` database in SQL. In your SQL, be sure to join all 4 tables together, so that the resulting dataframe contains all the contract, payment, and internet service options. Obtain your data from the _Codeup Data Science Database_. 

10. Once you've got your `get_titanic_data`, `get_iris_data`, and `get_telco_data` functions written, now it's time to add caching to them. To do this, edit the beginning of the function to check for the local filename of `telco.csv`, `titanic.csv`, or `iris.csv`. If they exist, use the .csv file. If the file doesn't exist, then produce the SQL and pandas necessary to create a dataframe, then write the dataframe to a .csv file with the appropriate name. 

In [32]:
import acquire as ac

In [33]:
titanic = ac.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [34]:
iris = ac.get_iris_data()
iris

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa
5,1,6,5.4,3.9,1.7,0.4,setosa
6,1,7,4.6,3.4,1.4,0.3,setosa
7,1,8,5.0,3.4,1.5,0.2,setosa
8,1,9,4.4,2.9,1.4,0.2,setosa
9,1,10,4.9,3.1,1.5,0.1,setosa


In [35]:
telco = ac.get_telco_data()
telco.head()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,Yes,...,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check
1,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,...,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
2,1,2,1,0004-TLHLJ,Male,0,No,No,4,Yes,...,No,No,No,Yes,73.9,280.85,Yes,Month-to-month,Fiber optic,Electronic check
3,1,2,1,0011-IGKFF,Male,1,Yes,No,13,Yes,...,No,Yes,Yes,Yes,98.0,1237.85,Yes,Month-to-month,Fiber optic,Electronic check
4,2,2,1,0013-EXCHZ,Female,1,Yes,No,3,Yes,...,Yes,Yes,No,Yes,83.9,267.4,Yes,Month-to-month,Fiber optic,Mailed check


In [36]:
telco.shape

(7043, 24)

## Data Preparation 

## Exercises

The end product of this exercise should be the specified functions in a python script named `prepare.py`.
Do these in your `classification_exercises.ipynb` first, then transfer to the prepare.py file. 

This work should all be saved in your local `classification-exercises` repo. Then add, commit, and push your changes.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# train test split from sklearn
from sklearn.model_selection import train_test_split
# imputer from sklearn
from sklearn.impute import SimpleImputer

# filter out warnings
import warnings
warnings.filterwarnings('ignore')

# our own acquire script:
import acquire as ac

**Using the Iris Dataset:**  

1. Use the function defined in `acquire.py` to load the iris data.  

1. Drop the `species_id` and `measurement_id` columns.  

1. Rename the `species_name` column to just `species`.  

1. Create dummy variables of the species name. 

1. Create a function named `prep_iris` that accepts the untransformed iris data, and returns the data with the transformations above applied.  

In [16]:
iris_df = ac.get_iris_data()

In [17]:
iris_df.head(1)

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa


In [18]:
iris_df.drop(columns = ['species_id', 'measurement_id'], inplace = True)

In [19]:
iris_df.head(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_name
0,5.1,3.5,1.4,0.2,setosa


In [20]:
iris_df.rename(columns = {'species_name':'species'}, inplace = True)

In [21]:
iris_df.head(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa


In [22]:
#Create dummy variables of the species name.
#(df[['sex', 'class', 'embark_town']], dummy_na=False, drop_first= True)
iris_dummies = pd.get_dummies(iris_df['species'], dummy_na = False, drop_first = True)
iris_df = pd.concat([iris_df, iris_dummies] , axis = 1)
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,versicolor,virginica
0,5.1,3.5,1.4,0.2,setosa,0,0
1,4.9,3.0,1.4,0.2,setosa,0,0
2,4.7,3.2,1.3,0.2,setosa,0,0
3,4.6,3.1,1.5,0.2,setosa,0,0
4,5.0,3.6,1.4,0.2,setosa,0,0


In [24]:
iris_df.shape

(150, 7)

In [23]:
iris_df.info() #no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
 5   versicolor    150 non-null    uint8  
 6   virginica     150 non-null    uint8  
dtypes: float64(4), object(1), uint8(2)
memory usage: 7.3+ KB


In [39]:
def prep_iris(df):
    '''
    accepts the untransformed iris data, and returns the data with the transformations above applied.
    '''
    df = df.drop_duplicates()
    #drop species_id and 'mesurement_id'
    df.drop(columns = ['species_id', 'measurement_id'], inplace = True)
    #rename 'species_name' to 'species'
    df.rename(columns = {'species_name':'species'}, inplace = True)
    #get dummies for 
    dummies = pd.get_dummies(df['species'], dummy_na = False, drop_first = True)
    return pd.concat([df, dummies] , axis = 1)

In [29]:
#check if it works
check_iris = prep_iris(ac.get_iris_data())
check_iris.head(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,versicolor,virginica
0,5.1,3.5,1.4,0.2,setosa,0,0


**Using the Titanic Dataset:**

1. Use the function defined in acquire.py to load the Titanic data.

1. Drop any unnecessary, unhelpful, or duplicated columns.

1. Encode the categorical columns. Create dummy variables of the categorical columns and concatenate them onto the dataframe.

1. Create a function named `prep_titanic` that accepts the raw titanic data, and returns the data with the transformations above applied.

In [34]:
def prep_titanic(df):
    '''
    Takes in a titanic dataframe and returns a cleaned dataframe
    Arguments: df - a pandas dataframe with the expected feature names and columns
    Return: clean_df - a dataframe with the cleaning operations performed on it
    '''
    # Drop duplicates
    df.drop_duplicates(inplace=True)
    # Drop columns 
    columns_to_drop = ['embarked', 'pclass', 'passenger_id', 'deck']
    df = df.drop(columns = columns_to_drop)
    # encoded categorical variables
    dummy_df = pd.get_dummies(df[['sex', 'class', 'embark_town']], dummy_na=False, drop_first=[True, True])
    df = pd.concat([df, dummy_df], axis=1)
    return df

In [31]:
titanic_df = ac.get_titanic_data()
titanic_df.head(1)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0


In [33]:
clean_titanic_data(titanic_df).info() #there are null values in age and embark_town

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   survived                 891 non-null    int64  
 1   sex                      891 non-null    object 
 2   age                      714 non-null    float64
 3   sibsp                    891 non-null    int64  
 4   parch                    891 non-null    int64  
 5   fare                     891 non-null    float64
 6   class                    891 non-null    object 
 7   embark_town              889 non-null    object 
 8   alone                    891 non-null    int64  
 9   sex_male                 891 non-null    uint8  
 10  class_Second             891 non-null    uint8  
 11  class_Third              891 non-null    uint8  
 12  embark_town_Queenstown   891 non-null    uint8  
 13  embark_town_Southampton  891 non-null    uint8  
dtypes: float64(2), int64(4), o

**Using the Telco Dataset:**

1. Use the function defined in `acquire.py` to load the Telco data.

1. Drop any unnecessary, unhelpful, or duplicated columns. This could mean dropping foreign key columns but keeping the corresponding string values, for example.

1. Encode the categorical columns. Create dummy variables of the categorical columns and concatenate them onto the dataframe.

1. Create a function named `prep_telco` that accepts the raw telco data, and returns the data with the transformations above applied.

In [40]:
telco_df = ac.get_telco_data()
telco_df.head(1)

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,Yes,...,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check


In [41]:
telco_df.drop_duplicates(inplace = True)

In [45]:
telco_df.drop(
    columns = ['customer_id', 'contract_type_id', 'internet_service_type_id', 'payment_type_id'], 
    inplace = True)

In [46]:
telco_df.shape

(7043, 20)

### Create dummies for Telco

In [47]:
telco_df.info() #no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   gender                 7043 non-null   object 
 1   senior_citizen         7043 non-null   int64  
 2   partner                7043 non-null   object 
 3   dependents             7043 non-null   object 
 4   tenure                 7043 non-null   int64  
 5   phone_service          7043 non-null   object 
 6   multiple_lines         7043 non-null   object 
 7   online_security        7043 non-null   object 
 8   online_backup          7043 non-null   object 
 9   device_protection      7043 non-null   object 
 10  tech_support           7043 non-null   object 
 11  streaming_tv           7043 non-null   object 
 12  streaming_movies       7043 non-null   object 
 13  paperless_billing      7043 non-null   object 
 14  monthly_charges        7043 non-null   float64
 15  tota

In [48]:
telco_df.head()

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,Female,0,Yes,Yes,9,Yes,No,No,Yes,No,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check
1,Male,0,No,No,9,Yes,Yes,No,No,No,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
2,Male,0,No,No,4,Yes,No,No,No,Yes,No,No,No,Yes,73.9,280.85,Yes,Month-to-month,Fiber optic,Electronic check
3,Male,1,Yes,No,13,Yes,No,No,Yes,Yes,No,Yes,Yes,Yes,98.0,1237.85,Yes,Month-to-month,Fiber optic,Electronic check
4,Female,1,Yes,No,3,Yes,No,No,No,No,Yes,Yes,No,Yes,83.9,267.4,Yes,Month-to-month,Fiber optic,Mailed check


In [82]:
# Fix total charges
'''
telco_df[~(telco_df.total_charges.str.isdigit())]
#telco_df.total_charges.astype(float) returns - 'couldn't convert '' to float
telco_df[telco_df.total_charges == '']
'''

In [68]:
telco_df.total_charges = telco_df.total_charges.replace(' ', np.nan).astype(float)

In [69]:
telco_df = telco_df.dropna()

In [83]:
#telco_df.info()

In [72]:
#fix senior_citizen
#it's not needed because it's already 0s and 1s
#I won't add it to the function
telco_df.senior_citizen = telco_df.senior_citizen.astype(str)

In [80]:
col_dummies2 = [] #for 2 values
col_dummies = []
for col in telco_df.columns:
    if telco_df[col].dtype == 'O' and telco_df[col].nunique() == 2:
        col_dummies2.append(col)
    elif telco_df[col].dtype == 'O' and telco_df[col].nunique() > 2:
        col_dummies.append(col)

In [86]:
#dummies for columns with 2 values
telco_dummies2 = pd.get_dummies(telco_df[col_dummies2], dummy_na=False, drop_first= True)

#dummies for columns with 2+ values drop_first = First, because 
telco_dummies = pd.get_dummies(telco_df[col_dummies], dummy_na=False, drop_first= False)

#concat
telco_df = pd.concat([telco_df, telco_dummies2, telco_dummies], axis = 1)

Create a function named ```prep_telco``` that accepts the raw telco data, and returns the data with the transformations above applied.

In [92]:
def prep_telco(df):
    df.drop_duplicates(inplace = True)
    df.drop(
    columns = ['customer_id', 'contract_type_id', 'internet_service_type_id', 'payment_type_id'], 
    inplace = True)
    df.total_charges = df.total_charges.replace(' ', np.nan).astype(float)
    df = df.dropna()
    
    #columns for dummies
    col_dummies2 = [] #for 2 values
    col_dummies = []
    for col in df.columns:
        if df[col].dtype == 'O' and df[col].nunique() == 2:
            col_dummies2.append(col)
        elif df[col].dtype == 'O' and df[col].nunique() > 2:
            col_dummies.append(col)
            
    #create dummies
    #dummies for columns with 2 values
    telco_dummies2 = pd.get_dummies(df[col_dummies2], dummy_na=False, drop_first= True)

    #dummies for columns with 2+ values drop_first = First, because 
    telco_dummies = pd.get_dummies(df[col_dummies], dummy_na=False, drop_first= False)
    
    #concat and return
    return pd.concat([df, telco_dummies2, telco_dummies], axis = 1)
    

In [117]:
prep_telco(ac.get_telco_data()).shape

(7032, 57)

Senior citizen is already 0s and 1s, so, we can leave it

**Split your data**

1. Write a function to split your data into `train`, `validate`, and `test` datasets. Add this function to `prepare.py`.

1. Run the function in your notebook on the Iris dataset, returning 3 datasets: `train_iris`, `validate_iris`, and `test_iris`.

1. Run the function on the Titanic dataset, returning 3 datasets: `train_titanic`, `validate_titanic`, and `test_titanic`.

1. Run the function on the Telco dataset, returning 3 datasets: `train_telco`, `validate_telco`, and `test_telco`.

In [118]:
def split_db(df, target):

    train, test = train_test_split(df,
                               train_size = 0.8,
                               stratify = df[target])
    train, validate = train_test_split(train,
                                  train_size = 0.7,
                                  stratify = train[target])
    
    return train, validate, test

In [135]:
train_iris, validate_iris, test_iris = split_db(prep_iris(ac.get_iris_data()), 'species')

In [136]:
train_titanic, validate_titanic, test_titanic = split_db(prep_titanic(ac.get_titanic_data()), 'survived')

In [137]:
train_telco, validate_telco, test_telco = split_db(prep_telco(ac.get_telco_data()), 'churn')