# EDA MIX data

To explore all the 81 colluns of this data frame the NEO team decide to split the EDA in 3 notebooks where:
1- EDA of Structural Data
2- EDA of Rooms data
3- EDA of Mix Data (data that's not about Structural and Rooms data)

In [1]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import pandas as pd
import numpy as np

#Cdf function
def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result

def multiple_ecdf(column, target='SalePrice'):
    fig = go.Figure()
    keys = train[column].unique()
    for key in keys:
        bool_series = train[column]==key
        fig.add_trace(
            go.Scattergl(
                x=np.unique(train[bool_series][target]), 
                y=ecdf(train[bool_series][target])(np.unique(train[bool_series][target])), 
                line_shape='hv',
                name=str(key) + ', total: ' + str(bool_series.sum()))
        )
    return fig.show()

train = pd.read_csv('../data/raw/train.csv')

**Mix Columns**

The columns of the train dataset that represents what we're calling mix data are represented in the strings in the variable **mix_columns**

In [None]:
mix_columns = ['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',
               'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
               'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
               'Condition2', 'YearBuilt', 'YearRemodAdd', 'BldgType', 'HouseStyle',
               'Functional', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold',
               'SaleType', 'SaleCondition', 'SalePrice']

train[mix_columns].head()

In [None]:
train[mix_columns].isna().mean()

In [None]:
train[mix_columns].describe()

## Nan Values

This part of the dataset has NanValue in 3 columns. In the description, it's easy to understand explanation for the Alley and MiscFeature columns, so we're going to understand why we have nan values in LotFrontage column

In [None]:
train[train['MSZoning']=='RH']['LotFrontage']

## Sales variables

In this topic we're going to analyse the variables that's related with the sale. This variables are:

- MoSold
- YrSold
- SaleType
- SaleCondition
- SalePrice

In [None]:
sns.pairplot(train[['SalePrice','MoSold','YrSold']])

In [None]:
train['SaleCondition'].value_counts()

**Sale Condition**

The sale condition is a categorical variable that has 6 levels. The levels "Family", "Alloca" and "AdjLand" has few data, so we're going to joing this categorical's in a level named 'Other'.


In [None]:
multiple_ecdf('SaleCondition')

In [None]:
train['SaleCondition'].replace(['AdjLand','Alloca','Family'],'Other', inplace=True)

In [None]:
multiple_ecdf('SaleType')

In [None]:
train['SaleType'].replace(['ConLD', 'ConLI', 'ConLw'], 'Con', inplace=True)
train['SaleType'].replace('CWD', 'WD', inplace=True)

**Sale date**

We're analyse how the sale price change a long the time. This can inform, indirectly, economics patterns

In [None]:
train.groupby(['YrSold'])['SalePrice'].describe()

In [None]:
train[train['YrSold']==2010]['MoSold'].value_counts()

In [None]:
fig = make_subplots(rows=1, cols=2,
                   subplot_titles=("Mean", "Std"))

fig.add_trace(go.Scatter(x=np.sort(train['YrSold'].unique()), 
                         y=train.groupby(['YrSold'])['SalePrice'].mean()),
            row=1,col=1)

fig.add_trace(go.Scatter(x=np.sort(train['YrSold'].unique()), 
                         y=train.groupby(['YrSold'])['SalePrice'].std()),
            row=1,col=2)

fig.update_layout(showlegend=False, title_text="Mean and Std of Sales Price per Year")
fig.show()

In [None]:
train['day'] = 1
train['dateSold'] = pd.to_datetime(train.rename(columns={"YrSold": "year",
                                                         "MoSold": "month"})[['year','month','day']],
                                   unit='D')
train.drop(columns=['day'], inplace=True)

In [None]:
fig = px.line(train.groupby(['dateSold'])[['SalePrice', 'dateSold']].mean(), 
              x=train.groupby(['dateSold'])[['SalePrice', 'dateSold']].mean().index,
              y='SalePrice')
fig.show()

**Sale date with Condition and Type**

We're going to analyse the sale price in time, grouping by the condition and type.

In [None]:
fig = px.line(train.groupby(['dateSold'])[['SalePrice', 'dateSold']].std(), 
              x=train.groupby(['dateSold'])[['SalePrice', 'dateSold']].std().index,
              y='SalePrice')
fig.show()

In [None]:
fig = px.scatter(train, x='dateSold', y='SalePrice',
                 trendline="lowess", color='SaleType', opacity=0.2)
fig.show()

In [None]:
fig = px.scatter(train, x='dateSold', y='SalePrice',
                 trendline="lowess", color='SaleCondition', opacity=0.2)
fig.show()

## Neighborhood

We're going to analyze how much aspects in the neighborhood make diference in the house's price.

- MSZoning
- Street
- Alley
- Neighborhood
- Condition 1
- Condition 2

In [18]:
multiple_ecdf('MSZoning')

In [13]:
train['Street'].value_counts()

Pave    1454
Grvl       6
Name: Street, dtype: int64

In [20]:
train['Alley'].value_counts()

Grvl    50
Pave    41
Name: Alley, dtype: int64

In [27]:
multiple_ecdf('Alley')

In [28]:
multiple_ecdf('Neighborhood')

In [29]:
multiple_ecdf('Condition1')

In [31]:
multiple_ecdf('Condition2')