# Exploration Exercises

### Who has spent the most money in their lifetime? Within an average of 5 dollars per customer.

1. Make sure to perform a train, validate, test split before and use only your train dataset to explore the relationships between independent variables with other independent variables or independent variables with your target variable.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from scipy.stats import pearsonr, spearmanr

import explore
import wrangle_zillow

import warnings
warnings.filterwarnings("ignore")

In [None]:
raw = explore.get_telco()
df = raw.copy()

In [None]:
df = explore.clean_telco(df)

In [None]:
train_validate, test = train_test_split(df,
                                        test_size=.2, 
                                             random_state=123, 
                                             stratify=df.churn)

train, validate = train_test_split(train_validate,
                                      test_size=.3,
                                      random_state=123,
                                      stratify=train_validate.churn)

train.shape, validate.shape, test.shape

2. Write a function named plot_variable_pairs that accepts a dataframe as input and plots all of the pairwise relationships along with the regression line for each pair.

3. Write a function named months_to_years that accepts your telco churn dataframe and returns a dataframe with a new feature tenure_years, in complete years as a customer.

4. Write a function named plot_categorical_and_continuous_vars that accepts your dataframe and the name of the columns that hold the continuous and categorical features and outputs 3 different plots for visualizing a categorical variable and a continuous variable.

5. Save the functions you have written to create visualizations in your explore.py file. Rewrite your notebook code so that you are using the functions imported from this file.

6. Explore your dataset with any other visualizations you think will be helpful.

7. In a seperate notebook, use the functions you have developed in this exercise with the mall_customers dataset in the Codeup database server. You will need to write a sql query to acquire your data. Make spending_score your target variable.

# Exercises II - Challenge for Zillow
- Your goal is to predict the values of single unit properties using the observations from 2017.

1. Use the functions you created above to explore your Zillow train dataset in your explore.ipynb notebook

In [None]:
train, validate, test= wrangle_zillow.wrangle_zillow()

In [None]:
train.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [None]:
cols = train.drop(columns=['tax_value','tax_amount','area'])
for col in cols:
    print(train[col].value_counts())
# 3 fips, majority is 6037, 

In [None]:
# Takes a long time.
sns.pairplot(train)

In [None]:
sns.lmplot(x='year_built', y='area', data=train, scatter=True, hue=None, col=None)

In [None]:
sns.jointplot(x='year_built', y='area', data=train, kind='scatter', hue='tax_value')

In [None]:
cols = train.drop(columns=['fips']).corr()

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(cols, cmap='Purples', annot=True, linewidth=0.5, mask= np.triu(cols))
plt.ylim(0, 6)
plt.show()

In [None]:
# .25 correlation between tax_value and year_built. Shows that value increases with area & year_built
plt.figure(figsize=(10,10))
sns.relplot(x='year_built', y='area', data=train,  hue='tax_value', kind='scatter')

In [None]:
#shows that increasd area tends to have increases bathrooms. Slight correlation to increased value.
plt.figure(figsize=(10,10))
sns.relplot(x='area', y='tax_value', data=train, hue='bathrooms', kind='scatter')

In [None]:
# Shows that bedrooms does not associate to value. Mostly to area.
plt.figure(figsize=(10,10))
sns.relplot(x='area', y='tax_value', data=train, hue='bedrooms', kind='scatter')

In [None]:
plt.figure(figsize=(10,10))
sns.relplot(x='year_built', y='tax_value', data=train, hue='bathrooms', kind='scatter')

In [None]:
# Shows that later years have more bedrooms but seem to impact value
plt.figure(figsize=(10,20))
sns.relplot(x='year_built', y='tax_value', data=train, hue='bedrooms', kind='scatter')
plt.show()

In [None]:
# Shows increased area has  year increased and more high areas have high value
plt.figure(figsize=(10,20))
sns.relplot(x='year_built', y='tax_value', data=train, hue='area', kind='scatter')
plt.show()

In [None]:
# shows that 6037 has majority of properties, 6059 has more high value properties
plt.figure(figsize=(15,5))
sns.relplot(x='year_built', y='tax_value', data=train, hue='fips', kind='scatter')
plt.show()

In [None]:
# shows 
plt.figure(figsize=(15,5))
sns.barplot(data=train, y='area', x= hue='target_value', col='fips')
plt.show()

In [None]:
# shows 
plt.figure(figsize=(15,5))
sns.boxplot(data=train, y='tax_value', x='fips')
plt.show()

In [None]:
# shows 
plt.figure(figsize=(15,5))
sns.barplot(data=train, y='tax_value', x='area', hue='fips')
plt.show()

In [None]:
# shows 
plt.figure(figsize=(15,5))
sns.barplot(data=train, y='tax_value', x='fips')
plt.show()

In [None]:
# shows that bathrooms have a slight correlation to value
plt.figure(figsize=(15,5))
sns.boxplot(data=train, y='tax_value', x='bathrooms')
plt.show()

In [None]:
# shows that bedrooms have a slight correlation to value
plt.figure(figsize=(15,5))
sns.boxplot(data=train, y='tax_value', x='bedrooms')
plt.show()

In [None]:
# Takes too long to show.
sns.swarmplot(x='bathrooms', y='area', data=train, hue='tax_value)

In [None]:
sns.stripplot(x='bathrooms', y='area', data=train, hue='tax_value')

2. Come up with some initial hypotheses based on your goal of predicting property value.

H0 = propety value is does not increase with area
Ha = property value increases with area 

3. Visualize all combinations of variables in some way

above

4. Run the appropriate statistical tests where needed.

In [None]:
# shows that there is a correlation between area and value
r, p_value = spearmanr(train.area, train.tax_value)
print(f'Correlation Coefficient: {r}\nP-value: {p_value}')

5. What independent variables are correlated with the dependent variable, home value?
- area has highest correlation to tax value, then bathrooms, followed by year_built, then bedrooms

6. Which independent variables are correlated with other independent variables (bedrooms, bathrooms, year built, square feet)?

area and bathroom, 

7. Make sure to document your takeaways from visualizations and statistical tests as well as the decisions you make throughout your process.

More Questions:
- Do bathrooms weigh more than bedrooms?
- Does year_built weigh more than area?
- 

In [2]:
# function based on joshua wheeler
continuous = ['area','year_built','tax_value']
categorical =['bedrooms','bathrooms']

In [5]:
for con in continuous:
    for cat in categorical:
        plt.figure(figsize=(20,10))
        plt.barplot(x=train[cat], y=train[con], data=train)

AttributeError: module 'matplotlib.pyplot' has no attribute 'barplot'

<Figure size 1440x720 with 0 Axes>