# Data Exploration, Visualization, and Regression

Data scientists want to know characteristics of their data. This notebook will cover how to explore a dataset.

--The pandas package is used to explore datasets. 
--The numpy package is used for high level mathematical functions. 
--The matplotlib package is used for visualization (graphs). 
--The seaborn package makes graphs more beautiful.

In [6]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visually Displaying your Data w/ Plots and Graphs
## Line graph, histogram, scatter plot, bar graph, pi chart

**Line Chart w/ Matplotlib (plt)**
Create two lists, x and y. Plot and label a line chart with x and y.

In [8]:
x = [1,2,3,4,5,6,7,8,9,10]
y = [0.3,4/5,5,5.5,6,8,12,17,27,30]
plt.plot(x,y)
plt.title("Hey, This is How You Make a Title")
plt.xlabel("x values")
plt.ylabel("y values")

**Histogram with Matplotlib (plt)**
Create two variables: **N_points** and **n_bins**
- N_points will be referenced in as the x variable in the plot.
- n_bins reprsents the number of buckets we will use in the histogram. 

In [10]:
# CREATE HISTOGRAM
N_points = 100000
n_bins = 10

# Generate a normal distribution, center at x=0 and y=5
x = np.random.randn(N_points)
y = .4 * x + np.random.randn(100000) + 5

fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)

# We can set the number of bins with the `bins` kwarg
axs[0].hist(x, bins=n_bins)
axs[1].hist(y, bins=n_bins)

**Scatter Plot with Matplotlib (plt)**
Create 3 sets for variables: g1, g2 and g3 then combine the three sets in **data**. 

The for loop assigns a color and a group to each feature in **data**
**ax.scatter**() creates the scatterplot.

In [12]:
# CREATE SCATTER PLOT
# Create data
N = 60
g1 = (0.6 + 0.6 * np.random.rand(N), np.random.rand(N))
g2 = (0.4+0.3 * np.random.rand(N), 0.5*np.random.rand(N))
g3 = (0.3*np.random.rand(N),0.3*np.random.rand(N))

data = (g1, g2, g3)
colors = ("red", "green", "blue")
groups = ("coffee", "tea", "water")

# Create plot
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, facecolor="1.0")

for data, color, group in zip(data, colors, groups):
    x, y = data
ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=30, label=group)

plt.title('Matplot scatter plot')
plt.legend(loc=2)
plt.show()

**Bar Plot with Matplotlib (plt)**

In [14]:
# CREATE BAR PLOT
labels = ['G1', 'G2', 'G3', 'G4', 'G5']
men_means = [20, 34, 30, 35, 27]
women_means = [25, 32, 34, 20, 25]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)

fig.tight_layout()

plt.show()

**Pie Chart with Matplotlib (plt)**

In [16]:
# Data to plot
labels = 'Python', 'C++', 'Ruby', 'Java'
sizes = [215, 130, 245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()

# HOUSING PRICES DATASET
We will use the housing prices dataset to start practicing techniques for viewing and analyzing datasets with Pandas.

The housing prices dataset is sourced from Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In [17]:
# OPEN DATASET
df = pd.read_csv("housing_price.csv")

In [18]:
# VIEW TOP 10 ROWS OF THE DATASET
df.head(10)

In [19]:
# shape shows (#rows, #columns)
df.shape

**Review all of the Features in the Dataset**

In [21]:
df.columns

**Define the types of integers**
- int64: integer
- object: Text or mixed numeric and non-numeric values
- float: Floating point numbers

In [23]:
df.info()

**Review Summary Statistics of the Features**

In [25]:
df.describe()

**Review Percent of Missing Values from the Dataset**

In [27]:
# PERCENT OF MISSING VALUES
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

**Selecting Specific Features**

Look into the Alley feature to review some of the values

In [41]:
df[['Alley']]

In [42]:
df[df.Alley.isnull()]

### Review Correlations of the Features
- Create a correlation matrix using corr()
- Plot a heatmap of the correlation matrix using seaborn (sns)

In [44]:
# CORRELATION PLOT
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=1, square=True);

In [46]:
# Order the variables in order of correlation
k = 20 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
f, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(df[cols].corr(), vmax=.8, square=True);

### Create a Pairplot to review feature relationships
The feature that we are trying to predict is **SalePrice**

These plots will help us review the relationship of SalePrice with two strongly correlated features

In [48]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea']
sns.pairplot(df[cols], height = 4);

### Review Summary Statistics of the SalePrice feature

In [49]:
df['SalePrice'].describe()

### Review distribution of the SalePrice feature

In [51]:
sns.distplot(df['SalePrice']);

Review the skewness of the SalePrice feature

In [53]:
print("Skewness: %f" % df['SalePrice'].skew())

In [55]:
var = 'GrLivArea'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

In [57]:
var = 'TotalBsmtSF'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

In [62]:
#Review the number of houses in each OverallQual rating group
df['OverallQual'].value_counts()

5     397
6     374
7     319
8     168
4     116
9      43
3      20
10     18
2       3
1       2
Name: OverallQual, dtype: int64

In [64]:
var = 'OverallQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(14, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [65]:
var = 'YearBuilt'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

In summary
Stories aside, we can conclude that:

'GrLivArea' and 'TotalBsmtSF' seem to be linearly related with 'SalePrice'. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of 'TotalBsmtSF', we can see that the slope of the linear relationship is particularly high.
'OverallQual' and 'YearBuilt' also seem to be related with 'SalePrice'. The relationship seems to be stronger in the case of 'OverallQual', where the box plot shows how sales prices increase with the overall quality.

# MISSING DATA
How do data scientists deal with missing values? There are a few ways to tackle this problem.

First, **how much data is missing**? It is not wise to work with data where over 30% of the data is missing. However, we do not always have a choice. Discretion should be used for each project. 

If your data is quantitative, one way is to fill in the missing values with middle-type values: **mean or median**. If it is categorical, you can fill in the missing values to keep the same proportions. 

More advanced methods of imputation include: clustering and MICE. We will cover general clustering concepts in the next bootcamp! 

Another way is to simply ignore those data points. Take precaution and try to understand any bias that may come from this method.

**Filling in Missing Data within the Housing Prices Dataset**

In [67]:
#Review the missing data
missing_data.head(20)

In [68]:
# IMPUTE WITH MODE
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond','MSSubClass','SaleType','GarageYrBlt','MSZoning','Electrical','KitchenQual','BsmtQual','Exterior1st','Exterior2nd', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', "MasVnrType", "FireplaceQu"):
    df[col] = df.groupby("Neighborhood")[col].transform(lambda x: x.fillna(x.mode()[0]))

In [69]:
# IMPUTE WITH NONE
for col in ("Alley","MiscFeature","PoolQC",'Fence'):
    df[col] = df[col].fillna('None')

In [70]:
# IMPUTE WITH MEDIAN
for col in ("MasVnrArea", 'LotFrontage','GarageArea', 'GarageCars','BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    df[col] = df.groupby("Neighborhood")[col].transform(lambda x: x.fillna(x.median()))

In [71]:
# DROP DATA
df = df.drop(['Utilities'], axis=1)

In [72]:
# FILL IN WITH CERTAIN VALUE
df["Functional"] = df["Functional"].fillna("Typ")

In [73]:
# Check remaining missing values if any

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :df_na})
missing_data.head()

Unnamed: 0,Missing Ratio


# LINEAR REGRESSION
To learn about linear regression: https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html

In [74]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

In [75]:
#Create the X & Y Variables
X = df['SalePrice'].values.reshape(-1,1)
y = df['GrLivArea'].values.reshape(-1,1)

In [76]:
#Create the training a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [78]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm

In [79]:
#To retrieve the intercept:
print(regressor.intercept_)

#For retrieving the slope:
print(regressor.coef_)

In [80]:
y_pred = regressor.predict(X_test)

In [82]:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df

In [83]:
#Plot the residuals
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

In [84]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))