# explore_exercises.ipynb

This is my suggested path forward. You may choose your own adventure for your explore_exercises.ipynb
- If you want a more guided exploration setup, solve the exercises below with seaborn.
- If you want to get more practice with those explore.py functions, use them w/ the Telco dataset

Exercises
Continue in your explore_exercises.ipynb notebook. Use the iris dataset. As always, add, commit, and push your changes.

1. Split your data into train, validate, and test samples.

2. Create a swarmplot using a melted dataframe of all your numeric variables. The x-axis should be the variable name, the y-axis the measure. Add another dimension using color to represent species. Document takeaways from this visualization.

3. Create 4 subplots (2 rows x 2 columns) of scatterplots.

sepal_length x sepal_width
petal_length x petal_width
sepal_area x petal_area
sepal_length x petal_length

4. What are your takeaways? Write them down :)

5. Create a heatmap of each variable layering correlation coefficient on top.

6. Create a scatter matrix visualizing the interaction of each variable.

7. Is the sepal length significantly different in virginica compared to versicolor? Run a statistical experiment to test this.

8. Make sure to include a null hypothesis, alternative hypothesis, results, and summary.

9. What is your takeaway from this statistical testing?

10. Create any other visualizations and run any other statistical tests you think will be helpful in exploring this data set.


In [10]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

import acquire
import explore

In [11]:
# acquire data again
df = acquire.get_iris_data()

In [12]:
df.head()

Unnamed: 0,species_id,species_name,sepal_length,sepal_width,petal_length,petal_width,measurement_id
0,1,setosa,5.1,3.5,1.4,0.2,1
1,1,setosa,4.9,3.0,1.4,0.2,2
2,1,setosa,4.7,3.2,1.3,0.2,3
3,1,setosa,4.6,3.1,1.5,0.2,4
4,1,setosa,5.0,3.6,1.4,0.2,5


In [13]:
# Create a function named prep_iris that accepts the untransformed iris data, and returns the data with the transformations above applied.

def prep_iris(df):

    '''Prepares acquired Iris data for exploration'''
    
    # drop column using .drop(columns=column_name)
    df = df.drop(columns='species_id')
    
    # remame column using .rename(columns={current_column_name : replacement_column_name})
    df = df.rename(columns={'species_name':'species'})
    
    # create dummies dataframe using .get_dummies(column_name,not dropping any of the dummy columns)
    dummy_df = pd.get_dummies(df['species'], drop_first=False)
    
    # join original df with dummies df using .concat([original_df,dummy_df], join along the index)
    df = pd.concat([df, dummy_df], axis=1)
    
    return df

In [14]:
df = prep_iris(df)
df.head()

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width,measurement_id,setosa,versicolor,virginica
0,setosa,5.1,3.5,1.4,0.2,1,1,0,0
1,setosa,4.9,3.0,1.4,0.2,2,1,0,0
2,setosa,4.7,3.2,1.3,0.2,3,1,0,0
3,setosa,4.6,3.1,1.5,0.2,4,1,0,0
4,setosa,5.0,3.6,1.4,0.2,5,1,0,0


In [15]:
def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [17]:
def split_data(df):
    '''
    take in a DataFrame and return train, validate, and test DataFrames; stratify on species.
    return train, validate, test DataFrames.
    '''
    
    # splits df into train_validate and test using train_test_split() stratifying on species to get an even mix of each species
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.species)
    
    # splits train_validate into train and validate using train_test_split() stratifying on species to get an even mix of each species
    train, validate = train_test_split(train_validate, 
                                       test_size=.3, 
                                       random_state=123, 
                                       stratify=train_validate.species)
    return train, validate, test

In [19]:
train, validate, test = train_validate_test_split(df, target='species')
train.head(3)

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width,measurement_id,setosa,versicolor,virginica
79,versicolor,5.7,2.6,3.5,1.0,80,0,1,0
36,setosa,5.5,3.5,1.3,0.2,37,1,0,0
133,virginica,6.3,2.8,5.1,1.5,134,0,0,1


In [20]:
print(train.shape, validate.shape, test.shape)

(84, 9) (36, 9) (30, 9)


In [None]:
# train, validate, test = split_data(df)