### Exercise

+ Compute an approximation $\pi$ using Monte Carlo. 

How?  If we can estimate the area of the unit circle, then dividing by $r^2 = (1/2)^2 = 1/4$
gives an estimate of $\pi$. We may estimate the area by sampling bivariate uniforms and looking at the fraction that fall into the unit circle.

In [None]:
## My solution.

from math import sqrt
import random 
n = 1000000

count = 0
for i in range(n):
    u, v = random.random(), random.random()
    d = sqrt((u - 0.5)**2 + (v - 0.5)**2)
    if d < 0.5:
        count += 1.0

area_estimate = count / n

print(area_estimate * 4)  # dividing by radius**2

### Exercise 

+ Write a function with two parameters <code>a</code> and <code>b</code>, to compute the final amount we get if we deposite 1000€ during <code>a</code> years in a bank account with an interest rate of <code>b</code> per cent.
+ What is the result for ``a``=10, ``b``=10

In [None]:
# Your solution here

### Exercise 

+ Write a function with one parameter <code>a</code>, to compute the minimum period we need to double the amount in an account with an interest rate of <code>a</code> per cent.
+ What is the result for ``a``=3?

In [None]:
# Your solution here

### Exercise 

> (...) In mathematics, the sieve of Eratosthenes, one of a number of prime number sieves, is a simple, ancient algorithm for finding all prime numbers up to any given limit. A prime number is a natural number which has exactly two distinct natural number divisors: 1 and itself. To find all the prime numbers less than or equal to a given integer $n$ by Eratosthenes' method.
> (Source: *Wikipedia*)

+ Write a program to implement the Eratostenes algorithm. First create a list of consecutive integers from 2 through $n$: (2, 3, 4, ..., n). Then,    
    - Initially, let $p$ equal 2, the first prime number.
    - Starting from $p$, enumerate its multiples by counting to $n$ in increments of $p$, and mark them in the list (these will be 2p, 3p, 4p, etc.; the $p$ itself should not be marked).
    - Find the first number greater than $p$ in the list that is not marked. If there was no such number, stop. Otherwise, let $p$ now equal this new number (which is the next prime), and repeat from step 2.
    - When the algorithm terminates, all the numbers in the list that are not marked are prime. (...)

In [None]:
def sieve(n):
    "Return all primes <= n."
    np1 = n + 1
    s = range(np1) 
    s[1] = 0
    sqrtn = int(round(n**0.5))
    for i in xrange(2, sqrtn + 1): 
        if s[i]:
            s[i*i: np1: i] = [0] * len(xrange(i*i, np1, i))
    return filter(None, s)

%timeit sieve(1000000)

### Exercise  

+ Compute the set of bigrams of a string. (``'hola'->'ho'+'ol'+'la'``) 

In [None]:
a = "mariposa salvaje"
bigrams = [a[i:i+2] for i in range(len(a)-1)]
print bigrams

### Exercise 

+ Compute the set of substrings of a string.  

In [None]:
a="hola"
cont=0
for j in range(len(a)):
    for i in range(j+1,len(a)+1):
        cont=cont+1
        print cont,(a[j:i])

**Exercise:** 

+ Take two lists, say for example these two:

	``a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]  b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]``
    
  and write a program that returns a list that contains only the elements that are common between the lists (without duplicates).

In [None]:
a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
result = [i for i in set(a) if i in b]
print result

### Exercise

Build a simple class called ``Polynomia``l for representing and manipulating polynomial functions such as

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

The instance data for the class ``Polynomial`` will be the coefficients ($a_1, \dots, a_n$). Provide methods that:

+ Evaluate the polynomial, returning $p(x)$ for any $x$
+ Differentiate the polynomial, replacing the original coefficients with those of its derivative $p′$.

In [None]:
class Polynomial(object):

    def __init__(self, coefficients):
        """
        Creates an instance of the Polynomial class representing 
            p(x) = a_0 x^0 + ... + a_N x^N,           
        where a_i = coefficients[i].
        """
        self.coefficients = coefficients

    def eval(self, x):
        "Evaluate the polynomial at x."
        y = 0
        for i, a in enumerate(self.coefficients):
            y += a * x**i  
        return y

    def differentiate(self):
        "Reset self.coefficients to those of p' instead of p."
        new_coefficients = []
        for i, a in enumerate(self.coefficients):
            new_coefficients.append(i * a)
        # Remove the first element, which is zero
        del new_coefficients[0]  
        # And reset coefficients data to new values
        self.coefficients = new_coefficients
        
a = Polynomial([2,4])
print a.eval(1)
a.differentiate()
print a.eval(1)

### Exercise

Consider the polynomial

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

Write a function ``p`` such that ``p(x, coeff)`` computes the value in the polynomial given a point ``x`` and a list of coefficients ``coeff``. Try to use ``enumerate()`` in your loop.

In [None]:
def p(x, coeff):
    return sum(a * x**i for i, a in enumerate(coeff))

p(1, (2, 4))

### Exercise

1. Determine the maximum of a list of numerical values by using ``reduce``.
2. Calculate the sum of the numbers from 1 to 100 by using ``reduce``.

In [None]:
f = lambda a,b: a if (a > b) else b
print reduce(f, [47,11,42,102,13])

print reduce(lambda x, y: x+y, range(1,101))

### Exercise

By using a grid representation:
+ Build a graphical representation of all multiples of 3 numbers from 0 to 49 by using exclusively the ``slicing`` operator (no iterations). ``BlockGrid(50, 1, block_size=10, fill=(123, 234, 123))``
+ Build a graphical representation of the prime numbers from 0 to 4999. (Hint: Compute the list of prime numbers and map this list to the grid representation). ``BlockGrid(50, 100, block_size=10, fill=(123, 234, 123))``

### Exercise

Rewrite the following functions so that it is fully vectorized: that is, so that it consists of a sequence of NumPy operations on whole arrays, with no native Python loops.

In [None]:
import numpy as np
%timeit np.sum(np.arange(3000)) * np.sum(np.arange(3000))

In [None]:
%timeit np.sum(np.searchsorted(np.sort(np.arange(0, 200, 2)), np.arange(40, 140)))

### Exercise 

In the following table we have expression values for 5 genes at 4 time points. 

                        Gene name   4h	12h	  24h	48h
                        A2M        0.12	0.08  0.06	0.02
                        FOS        0.01	0.07  0.11	0.09
                        BRCA2      0.03	0.04  0.04	0.02
                        CPOX       0.05	0.09  0.11	0.14

+ Create a single array for the data (4x4)
+ Find the mean expression value per gene
+ Find the mean expression value per time point
+ Which gene has the maximum mean expression value? (Use the ``tab`` help on an array)

### Exercise

Consider the polynomial

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

Earlier, you wrote a simple function ``p(x, coeff)`` to evaluate it without considering efficiency. 

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

Hint: Use ``np.cumprod()``

In [None]:
def p(x, coef):
    X = np.empty(len(coef))
    X[0] = 1
    X[1:] = x
    y = np.cumprod(X)   # y = [1, x, x**2,...]
    res = np.sum(coef * y)
    return res

coef = np.random.rand(1,3000000)
print(p(1, coef))

%timeit p(1, coef)

def p(x, coeff):
    return sum(a * x**i for i, a in enumerate(coeff))

%timeit p(1, coef)

### Exercise

+ Read the titanic dataset from ``files/titanic.xls`` and inspect the first records.
+ Are there columns have NaN values? 
+ Drop those rows with NaN values in ``age``. 
+ What was the probability of survival? Get a variable with this value.
+ What was the probability of survival for each ``pclass``? Get a varible with this value.
+ What was the mean age for third class survivors? Get a variable with this value.

In [None]:
import pandas as pd
import numpy as np
titanic_data = pd.read_excel("files/titanic.xls")
titanic_data.head( 3 )

In [None]:
titanic_data.info()

In [None]:
df_cleaned = titanic_data.dropna(subset = ['age'])
df_cleaned.info()

In [None]:
p_survived = df_cleaned['survived'].mean() 
print p_survived

In [None]:
df_cleaned.groupby('pclass')['survived'].mean()

In [None]:
df_cleaned[(df_cleaned['pclass']==3) & (df_cleaned['survived']==1)]['age'].mean()

### Exercise

Movielens 1M database (http://www.grouplens.org/node/73) stores 1,000,209 scorings from 3.900 films that were compiled in 2000 from 6.040 anonymou users of the online MovieLens recommender (http://www.movielens.org/).

In [None]:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('files/ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('files/ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('files/ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python')

In [None]:
data = pd.merge(pd.merge(ratings, users), movies)
mean_ratings = data.groupby(by='user_id')['rating'].mean()
mean_ratings.head()

In [None]:
movies_by_user = data.pivot(index='movie_id', columns='user_id', values='rating')

def top_movie(dataFrame,usr):
    return dataFrame[usr].argmax()

print top_movie(movies_by_user, 1)

In [None]:
import numpy as np

def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df
data = pd.merge(pd.merge(ratings, users), movies)
data['for_testing'] = False
grouped = data.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = data[grouped.for_testing == False]
movielens_test = data[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape

def compute_rmse(y_pred, y_true):
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

def evaluate(estimate,test=movielens_test):
    ids_to_estimate = zip(test['user_id'], test['movie_id'])
    estimated = np.array([estimate(u,i) for (u,i) in ids_to_estimate])
    real = test.rating.values
    return compute_rmse(estimated, real)



In [None]:
pivoted_movielens_train = movielens_train.pivot(index='movie_id', columns='user_id', values='rating')
mean_movielens_train = pivoted_movielens_train.mean()

def rec2(user_id, item_id,train=mean_movielens_train):
    return train[user_id]

print 'Error: %s' % evaluate(rec2)

## Exercise

Time Series

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
import seaborn as sns
import statsmodels.api as sm  
from statsmodels.tsa.stattools import acf  
from statsmodels.tsa.stattools import pacf
from statsmodels.tsa.seasonal import seasonal_decompose

df = pd.read_csv('files/portland-oregon-average-monthly-.csv', index_col=0)
df.index.name=None
df.reset_index(inplace=True)
df.drop(df.index[114], inplace=True)
start = datetime.datetime.strptime("1973-01-01", "%Y-%m-%d")
date_list = [start + relativedelta(months=x) for x in range(0,114)]
df['index'] =date_list
df.set_index(['index'], inplace=True)
df.index.name=None
df.columns= ['riders']
df['riders'] = df.riders.apply(lambda x: int(x)*100)

In [None]:
df.riders.plot(figsize=(12,8), title= 'Monthly Ridership', fontsize=14)
plt.savefig('month_ridership.png', bbox_inches='tight')

In [None]:
decomposition = seasonal_decompose(df.riders, freq=12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(15, 8)

In [None]:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
    
    #Determing rolling statistics
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)

    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    
    #Perform Dickey-Fuller test:
    print 'Results of Dickey-Fuller Test:'
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print dfoutput
    
test_stationarity(df.riders)

In [None]:
df.riders_log= df.riders.apply(lambda x: np.log(x))  
test_stationarity(df.riders_log)

In [None]:
df['first_difference'] = df.riders - df.riders.shift(1)  
test_stationarity(df.first_difference.dropna(inplace=False))

In [None]:
df['log_first_difference'] = df.riders_log - df.riders_log.shift(1)  
test_stationarity(df.log_first_difference.dropna(inplace=False))

In [None]:
df['seasonal_difference'] = df.riders - df.riders.shift(12)  
test_stationarity(df.seasonal_difference.dropna(inplace=False))

In [None]:
df['log_seasonal_difference'] = df.riders_log - df.riders_log.shift(12)  
test_stationarity(df.log_seasonal_difference.dropna(inplace=False))

In [None]:
df['seasonal_first_difference'] = df.first_difference - df.first_difference.shift(12)  
test_stationarity(df.seasonal_first_difference.dropna(inplace=False))

In [None]:
df['log_seasonal_first_difference'] = df.log_first_difference - df.log_first_difference.shift(12)  
test_stationarity(df.log_seasonal_first_difference.dropna(inplace=False))

## Exercise

Time series

In [None]:
%matplotlib inline
import os  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import statsmodels.api as sm  
import seaborn as sb  
sb.set_style('darkgrid')

path = 'files/stock_data.csv'  
stock_data = pd.read_csv(path)  
stock_data['Date'] = stock_data['Date'].convert_objects(convert_dates='coerce') 
stock_data = stock_data.sort_values(by='Date')  
stock_data = stock_data.set_index('Date')  

stock_data['Close'].plot(figsize=(12, 4))  

In [None]:
stock_data['First Difference'] = stock_data['Close'] - stock_data['Close'].shift()  

stock_data['First Difference'].plot(figsize=(12, 4)) 

In [None]:
stock_data['Natural Log'] = stock_data['Close'].apply(lambda x: np.log(x))  
stock_data['Natural Log'].plot(figsize=(12, 4))  

In [None]:
stock_data['Original Variance'] = stock_data['Close'].rolling(window=30,center=True).var()  
stock_data['Log Variance'] = stock_data['Natural Log'].rolling(window=30,center=True).var()

fig, ax = plt.subplots(2, 1, figsize=(12, 4))  
stock_data['Original Variance'].plot(ax=ax[0], title='Original Variance')  
stock_data['Log Variance'].plot(ax=ax[1], title='Log Variance')  
fig.tight_layout()  

In [None]:
stock_data['Logged First Difference'] = stock_data['Natural Log'] - stock_data['Natural Log'].shift()  
stock_data['Logged First Difference'].plot(figsize=(12, 4)) 

In [None]:
stock_data['Lag 1'] = stock_data['Logged First Difference'].shift()  
stock_data['Lag 2'] = stock_data['Logged First Difference'].shift(2)  
stock_data['Lag 5'] = stock_data['Logged First Difference'].shift(5)  
stock_data['Lag 30'] = stock_data['Logged First Difference'].shift(30)  

In [None]:
sb.jointplot('Logged First Difference', 'Lag 1', stock_data, kind='reg', size=8) 

In [None]:
from statsmodels.tsa.stattools import acf  
from statsmodels.tsa.stattools import pacf

lag_correlations = acf(stock_data['Logged First Difference'].iloc[1:])  
lag_partial_correlations = pacf(stock_data['Logged First Difference'].iloc[1:]) 

In [None]:
fig, ax = plt.subplots(figsize=(12,4))  
ax.plot(lag_correlations, marker='o', linestyle='--')  

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(stock_data['Natural Log'], model='additive', freq=30)  
fig = plt.figure()  
fig = decomposition.plot()  

In [None]:
model = sm.tsa.ARIMA(stock_data['Natural Log'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Natural Log', 'Forecast']].plot(figsize=(12,4))  

In [None]:
model = sm.tsa.ARIMA(stock_data['Logged First Difference'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Logged First Difference', 'Forecast']].plot(figsize=(12,4)) 

In [None]:
stock_data[['Logged First Difference', 'Forecast']].iloc[1200:1600, :].plot(figsize=(12, 4))  


In [None]:
model = sm.tsa.ARIMA(stock_data['Logged First Difference'].iloc[1:], order=(0, 0, 1))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Logged First Difference', 'Forecast']].plot(figsize=(12,4)) 

## Exercise

Anomaly Detection.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context="notebook", style="white", palette=sns.color_palette("RdBu"))

import numpy as np
import pandas as pd
from scipy import stats

from sklearn.cross_validation import train_test_split

In [None]:
mat = pd.read_csv('files/ex8data1.csv')
X = np.array(mat)

In [None]:
fig, ax = plt.subplots(figsize=(6,4))  
plt.scatter(X[:,0], X[:,1], c='b', marker='x')
plt.title("Outlier detection")
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s)');

In [None]:
mu = X.mean(axis=0)
print mu, '\n'

cov = np.cov(X.T)
print cov

In [None]:
# create multi-var Gaussian model
multi_normal = stats.multivariate_normal(mu, cov)

# create a grid
x, y = np.mgrid[0:30:0.01, 0:30:0.01]
pos = np.dstack((x, y))

fig, ax = plt.subplots()

# plot probability density
ax.contourf(x, y, multi_normal.pdf(pos), cmap='Blues')

# plot original data points
sns.regplot('Latency', 'Throughput',
           data=pd.DataFrame(X, columns=['Latency', 'Throughput']), 
           fit_reg=False,
           ax=ax,
           scatter_kws={"s":10,
                        "alpha":0.4})

In [None]:
def estimate_gaussian(X):  
    mu = X.mean(axis=0)
    sigma = X.var(axis=0)

    return mu, sigma

mu, sigma = estimate_gaussian(X)  
mu, sigma  

In [None]:
p = np.zeros((X.shape[0], X.shape[1]))  
p[:,0] = stats.norm(mu[0], sigma[0]).pdf(X[:,0])  
p[:,1] = stats.norm(mu[1], sigma[1]).pdf(X[:,1])

outliers = np.where(p < 0.009)

fig, ax = plt.subplots(figsize=(6,4))  
ax.scatter(X[:,0], X[:,1])  
ax.scatter(X[outliers[0],0], X[outliers[0],1], s=50, color='r', marker='o') 

## Exercise

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline


sig2 = np.linspace(0, 5, num=30)
sig2 = np.concatenate([sig2 for x in xrange(12)])

# Add a jump
jump_size = 5
sig2[250:] = sig2[250:] + jump_size

# Noise
noise = np.random.normal(
    size=sig2.shape,
    scale=jump_size * 0.1)

ser = pd.Series(sig2) + noise

plt.figure(figsize=(15, 5))
plt.plot(ser, 'b.', alpha=0.5)
plt.ylim(0,15)
plt.xlim(0,365)
plt.title("sig2 : A non-trivial signal")

In [None]:
ser_diff = ser - ser.shift(30)  
plt.figure(figsize=(15, 5))
plt.plot(ser_diff, 'b.', alpha=0.5)
plt.ylim(0,15)
plt.xlim(0,365)
plt.title("sig2 : A non-trivial signal")

In [None]:
mean1 = ser_diff.rolling(window=3,center=False).mean()
mean2 = ser_diff.rolling(window=20,center=False).mean()
mean_dif = abs(mean1 - mean2)
change = mean_dif.argmax()
plt.figure(figsize=(15, 5))
plt.plot(ser_diff, 'b.', alpha=0.5)
plt.axvline(x=change, ymin=0, ymax = 100, linewidth=2, color='k')
plt.ylim(0,15)
plt.xlim(0,365)