## Lesson 1 - What is Data Science

In [12]:
# Importing packages and data

import pandas as pd
import os

data = pd.read_csv(location of data)

### Packages

We'll learn more about these later:
* Pandas: Data structures and analysis  
* NumPy: Base n-dimensional array package  
* SciPy: Fundamental library for scientific computing  
* Matplotlib: Comprehensive 2D/3D plotting  
* IPython: Enhanced interactive console  
* Sympy: Symbolic mathematics  
* Scikit-learn: Supervised/unsupervised learning algorithms

In [1]:
# Calling functions

def division(numerator, denominator):
    result = float(numerator) / denominator
    print result

division(20, 10)
division(10, 20)

2.0
0.5


In [5]:
# Split string

my_string = "the cow jumped over the moon"
words = my_string.split()
print(words)

['the', 'cow', 'jumped', 'over', 'the', 'moon']


In [15]:
# how many observations are in this dataframe - number of rows or number of items in an array
len(words)

6

In [None]:
# joins array
.join(words)

## Lesson 2 - Research Design and Pandas

In [None]:
# prints first five lines
.head()

In [None]:
# prints last five lines
.tail()

In [None]:
# to print column names
for x in data.columns.values:
    print x

### iloc vs loc vs ix

loc - label based  
iloc - position based  
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.  

(row selection, column selection)

In [None]:
# For NaN values

.isnull()
.dropna()

In [None]:
# How to get values from a specific column
print data['Ozone'].mean()

# How to get specific values for greater/less than
print data[(data.Ozone > 31) & (data.Temp > 90)].head()

# How to get a specific value
print data[data.Month==6].Temp.mean()

## Lesson 3 - Descriptive Statistics for Exploratory Data Analysis

Methods available include:  
.min() - Compute minimum value  
.max() - Compute maximum value  
.mean() - Compute mean value  
.median() - Compute median value  
.mode() - Compute mode value(s)  
.count() - Count the number of observations  
.std() - Compute Standard Deviation  
.var() - Compute variance  
.describe() - Get a summary of the data  
.corr() - Get correlation matrix

In [None]:
# Get quartiles
print df.quantile(.50) 
print df.quantile(0.25)
print df.quantile(0.75)

In [None]:
# Box Plot
df.plot(kind="box")
df['example1'].plot(kind='box')

## Lesson 4 - Inferential Statistics for Model Fit

In [None]:
# print the shape of the DataFrame
data.shape
# it will give you this:
(number of rows, number of columns)

In [None]:
# Create a scatter plot
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))

### The different kinds of plots:
    
* ‘bar’ or ‘barh’ for bar plots  
* ‘hist’ for histogram  
* ‘box’ for boxplot  
* ‘kde’ or 'density' for density plots  
* ‘area’ for area plots  
* ‘scatter’ for scatter plots  
* ‘hexbin’ for hexagonal bin plots  
* ‘pie’ for pie plots

In [None]:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

# create a fitted model in one line
# formula notiation is the equivalent to writting out our models such that 'outcome = predictor'
# with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN'
lm = smf.ols(formula='Sales ~ TV', data=data).fit()

# print the full summary
# Full summary shows r-squared, p-value, skew, kurtosis, etc.
lm.summary()

In [None]:
# You want a p-value that's less than 0.05 for it to be statistically significant.
# The higher the r-squared, the better. 

# What does a 95% CI indicate? 
Answer: That if we repeated this study 100 times our point estimte would lie within that range 95 times

## Lesson 5 - Intro to Regression and Model Fit

Linear regression finds the best fit line:

y = mx + b
y = betas * X + alpha (+ error)

Given a matrix X, their relative coefficients beta, and a y-intercept alpha, explain a dependent vector, y.

_coefficient - a numerical or constant quantity placed before and multiplying the variable in an algebraic expression, such as 4 in 4x._

====

A linear regression works best when:
* The data is normally distributed (but doesn't have to be)
* The Xs significantly explain y (have low p-values)
* The Xs are independent of each other (low multicollinearity)
* The resulting values passes linear assumptions (dependent on problem)

In [None]:
# matplotlib and seaborn are used for graphs

# create a matplotlib figure
plt.figure()
# generate a scatterplot inside the figure
# plt.plot(x axis value, y axis value, marker for the graph)
plt.plot(mammals.bodywt, mammals.brainwt, '.')
# show the plot
plt.show()

In [None]:
# lmplot returns a straight line
# .lmplot(x, y, data)
sns.lmplot('bodywt', 'brainwt', mammals)

Log transformations can be used to make highly skewed distributions less skewed. 

For example, as 64 = 26, then log2(64) = 6.  
log10(1) = 0  
log10(10) = 1  

In [None]:
# Because both values are a log-log distribution, some math properties allow us to transform 
# them into normal distributions.

# Create a new data set that converts all numeric variables into log10
log_columns = ['bodywt', 'brainwt',]
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

g = sns.lmplot('bodywt', 'brainwt', log_mammals)
g.set_axis_labels( "Log Body Weight", "Log Brain Weight")

# Even though we changed the way the data was shaped, this is still a linear result: 
# it's just linear in the log10 of the data, instead of in the data's natural state.

In [None]:
# Fills all NaN values with whatever number you want
.fillna(10)

### sklearn

When modeling with sklearn, you'll use the following base principals.

* All sklearn estimators (modeling classes) are based on this base estimator. This allows you to easily rotate through estimators without changing much code.  
* All estimators take a matrix, X, either sparse or dense.  
* Many estimators also take a vector, y, when working on a supervised machine learning problem. Regressions are supervised learning because we already have examples of y given X.  
*  All estimators have parameters that can be set. This allows for customization and higher level of detail to the learning process. The parameters are appropriate to each estimator algorithm.  

### OLS (Ordinary Least Squares)

Error is the difference between prediction and reality: the vertical distance between a real data point and the regression line. OLS is concerned with the squares of the errors. It tries to find the line going through the sample data that minimizes the sum of the squared errors.

This approach assumes that the sample is representative of the population; that is, it assumes that the sample is unbiased.

### Betas

Betas are the constants in the regression, the intercept and the slope. 

The intercept is the value of Y when X is 0.  
The slope is also known as the regression coefficient.   
Residual = the error

In [None]:
# Fit regression model 
Something = smf.ols(formula=' ', data=XX).fit()

In [None]:
# Predict value
10**lm.predict(X_new.apply(np.log10))

In [None]:
# Data.ColumnName.function
# bike_data.weathersit.value_counts()
.value_counts() - Number of times a value appears

### Correlations
To see the correlation matrix: .corr()

But to get a heatmap use:  
cmap = sns.diverging_palette(220, 10, as_cmap=True)  
correlations = bike_data[['temp', 'atemp', 'casual']].corr()  
print correlations  
print sns.heatmap(correlations, cmap=cmap)  

##########################################################

## Review over Citibike Data from Lesson 5

##########################################################

## Lesson 6 - Evaluating Model Fit

### Mean Squared Error (MSE)

Mean squared error is the mean, or average, residual error in our model.

To find MSE:

1. Calculate the difference between each target y and the models predicted predicted value y-hat (this is how we determine the residual)  
2. Square each residual.  
3. Take the mean of the squared residual error.  

#####

To calculate using the function:  

from sklearn import metrics  
metrics.mean_squared_error(y, model.predict(X))  

In [5]:
# Example
from sklearn import metrics
metrics.mean_squared_error([1, 2, 3, 4, 5], [5, 4, 3, 2, 1])
# (4^2 + 2^2 + 0^2 + 2^2 + 4^2) / 5

8.0

### Bias

When our error is described as biased, it means that the learner's prediction is consistently far away from the actual answer. This is a sign of poor sampling: perhaps the population is not well represented in the model, or other data needs to be collected. We'd prefer if the error was distributed more evenly across the model, even if that means it doesn't explain the sample as well.

### Cross Validation

One approach data scientists use to account for bias is cross validation. The basic idea of cross validation is to generate several models based on different cross sections of the data, measure performance of each, and then take the mean performance. This technique is one way to swap bias error for generalized error in our model.

### k-fold

Split the data into k groups, train the data on all segments except one, and then test the performance on the remaining set.

### Confusion Matrix

With precision, we're interested in producing a high amount of relevancy instead of irrelevancy. With recall, we're interesting in seeing how well a model returns specific data (literally, checking whether the model can recall what a class label looked like).

Imagine predicting a marble color either green or red. There are 10 of each. If the model identifies 8 of the green marbles as green, the recall, or sensitivity, is .8. However, this says nothing about the number of red marbles that are also identified as green.

Since the model predicted 8 of the green marbles as green, then precision would be 1, because all marbles predicted as green were in fact green. The precision of red marbles (assuming all red marbles were correct, and 2 green were predicted as red) would be roughly 0.833: 10 / (10 + 2)