# NumPy

In [148]:
import numpy as np
import pandas as pd

1. The most basic kind of broadcast is with a scalar, in which you can perform a binary operation (e.g., add, multiply, ...) on an array and a scalar, the effect is to perform that operation with the scalar for every element of the array. To try this out, create a vector 1, 2, . . . , 10 by adding 1 to the result of the arange function.

In [129]:
a = np.arange(1,11) 
a

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [130]:
print(a + 1)

[ 2  3  4  5  6  7  8  9 10 11]


2. Now, create a 10 × 10 matrix A in which A[i][j] = i + j. You’ll be able to do this using the vector you just created, and adding it to a reshaped version of itself.  

In [131]:
ls = []
for i in range(0,10):
    for j in range(0,10):
        ls.append(i+j)
a = np.array(ls).reshape(10,10)
a

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12],
       [ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13],
       [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
       [ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15],
       [ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       [ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]])

3. A very common use of broadcasting is to standardize data, i.e., to make it have zero mean and unit variance. <br>
a. First, create a fake “data set” with 50 examples, each with five dimensions. <br>b. import numpy.random as npr 
data = np.exp ( npr.randn(50, 5) ) <br>
c. Don’t worry too much about what this code is doing at this stage of the course, but for completeness: it imports the NumPy random number <br>
generation library, then generates a 50 × 5 matrix of standard normal random variates and exponentiates them. The effect of this is to have a pretend data set of 50 independent and identically-distributed vectors from a log-normal distribution.

In [132]:
import numpy.random as npr 
data = np.exp ( npr.randn(50, 5) )

In [133]:
data

array([[ 1.2917693 ,  0.97901   ,  1.85048452,  3.62131799,  0.38323013],
       [ 0.39385277,  0.10977465,  1.86845461,  1.23039769,  1.63990753],
       [ 4.79260597,  2.06352855,  0.29548909,  1.56748935,  1.2265444 ],
       [ 1.69045834,  1.10094413,  1.97442723,  0.28685894,  1.62218916],
       [ 0.82823791,  0.73568959,  1.57194114,  0.87621753,  1.51753947],
       [ 0.91150414,  0.19212883,  0.70277808,  0.08639028,  1.53615825],
       [ 0.69025388,  7.75687076,  4.04478547,  1.20007783,  1.01696495],
       [ 1.53815357,  1.50927691,  0.94250976,  0.58603381,  2.66222134],
       [ 2.66033079,  1.48336649,  1.09751214,  1.62198952,  2.87581942],
       [ 0.29982571,  5.57454374,  1.02960268,  1.48079991,  2.12075265],
       [ 1.56276999,  4.47283328,  0.94827593,  0.74713877,  0.27078834],
       [ 0.52486407,  0.52166507,  0.18065187,  2.44248965,  1.05925635],
       [ 1.60629064,  0.99540079,  3.30441722,  1.52017119,  0.33441397],
       [ 0.98781853,  0.38379103,  0.6

4. Now, compute the mean and standard deviation of each column. This should result in two vectors of length, You’ll need to think a little bit about how to use the axis argument to mean and std. Store these vectors into variables and print both of them.

In [134]:
mean = []
std = []
for i in range(0,5):
    mean.append(np.mean(data[:,i]))
    std.append(np.std(data[:,i]))
mean = np.array(mean)
std = np.array(std)
print(mean)
print(std)

[1.52402466 1.43150524 1.56223437 1.65916806 1.5095779 ]
[1.36121598 1.57068716 2.0075696  1.626791   1.49290982]


In [135]:
arr = np.concatenate([mean,std])
final_arr = np.reshape(arr,(2,5))
print(final_arr)
print(final_arr.ndim)

[[1.52402466 1.43150524 1.56223437 1.65916806 1.5095779 ]
 [1.36121598 1.57068716 2.0075696  1.626791   1.49290982]]
2


5. Now standardize the data matrix by <br>1) subtracting the mean of each column <br>2) dividing each column by its standard deviation. Do this via broadcasting, and store the result in a matrix called normalized. To verify that you successfully did it, compute the mean and standard deviation of the columns of normalized and print them out.

In [136]:
sub = []
for i in range(0,50):
    for j in range(0,5):
        sub.append(data[i,j] - mean[j])
mean_sub = np.array(sub).reshape(50,5)

In [137]:
mean_sub

array([[-2.32255355e-01, -4.52495243e-01,  2.88250148e-01,
         1.96214993e+00, -1.12634777e+00],
       [-1.13017188e+00, -1.32173060e+00,  3.06220240e-01,
        -4.28770373e-01,  1.30329626e-01],
       [ 3.26858131e+00,  6.32023303e-01, -1.26674527e+00,
        -9.16787166e-02, -2.83033498e-01],
       [ 1.66433684e-01, -3.30561115e-01,  4.12192864e-01,
        -1.37230912e+00,  1.12611260e-01],
       [-6.95786745e-01, -6.95815653e-01,  9.70677309e-03,
        -7.82950538e-01,  7.96156980e-03],
       [-6.12520524e-01, -1.23937642e+00, -8.59456292e-01,
        -1.57277778e+00,  2.65803502e-02],
       [-8.33770778e-01,  6.32536551e+00,  2.48255110e+00,
        -4.59090231e-01, -4.92612952e-01],
       [ 1.41289066e-02,  7.77716703e-02, -6.19724604e-01,
        -1.07313426e+00,  1.15264344e+00],
       [ 1.13630613e+00,  5.18612432e-02, -4.64722226e-01,
        -3.71785391e-02,  1.36624152e+00],
       [-1.22419895e+00,  4.14303850e+00, -5.32631684e-01,
        -1.78368154e-01

In [138]:
mean_sub.shape

(50, 5)

In [139]:
div = []
for i in range(0,50):
    for j in range(0,5):
        div.append(data[i,j] / std[j])
normalized = np.array(div).reshape(50,5)

In [140]:
normalized

array([[0.94898188, 0.62330044, 0.92175361, 2.22604993, 0.25670012],
       [0.28933893, 0.06988957, 0.93070477, 0.75633421, 1.09846389],
       [3.520827  , 1.31377438, 0.14718747, 0.96354685, 0.8215797 ],
       [1.24187371, 0.70093152, 0.9834913 , 0.17633423, 1.08659555],
       [0.60845445, 0.46838709, 0.78300705, 0.53861715, 1.01649775],
       [0.66962492, 0.12232151, 0.35006412, 0.05310472, 1.02896922],
       [0.50708623, 4.93852052, 2.01476725, 0.73769638, 0.68119651],
       [1.12998495, 0.96090231, 0.469478  , 0.36023915, 1.78324324],
       [1.95437817, 0.94440607, 0.54668697, 0.9970485 , 1.92631825],
       [0.22026314, 3.54911143, 0.51286027, 0.91025824, 1.42054974],
       [1.14806909, 2.84769202, 0.47235022, 0.45927152, 0.18138292],
       [0.38558471, 0.33212538, 0.08998536, 1.50141576, 0.70952467],
       [1.18004098, 0.63373587, 1.64597891, 0.93446005, 0.22400145],
       [0.72568832, 0.24434594, 0.32456972, 0.41840177, 0.59693279],
       [0.38795304, 0.12505047, 0.

In [141]:
normalized.shape

(50, 5)

In [142]:
mean = []
std = []
for i in range(0,5):
    mean.append(np.mean(normalized[:,i]))
    std.append(np.std(normalized[:,i]))
mean = np.array(mean)
std = np.array(std)
print(mean)
print(std)

[1.11960533 0.91138788 0.77817196 1.01990241 1.01116483]
[1. 1. 1. 1. 1.]


# Statistics

### Chi Square Test

The table below is a survey response to 4 categorical variables: people in categories from 18–29, 30–44, 45–64 and >65 years, and their movie genre inclination, which is “Action/Adventure”, “Romance” and “Biography”. Is there any evidence of a relationship between the age group and their movie genre inclination, at 5% significance level?

In [145]:
data = {
    'Age_Group' : ['18-29','30-44','45-64','65&Older'],
    'Action/Adventure' : [141,179,220,85],
    'Romance' : [68,159,216,101],
    'Biography' : [4,7,4,4]
}

In [151]:
df = pd.DataFrame(data)
df

Unnamed: 0,Age_Group,Action/Adventure,Romance,Biography
0,18-29,141,68,4
1,30-44,179,159,7
2,45-64,220,216,4
3,65&Older,85,101,4


#### H0 - There is a relationship between Age group and their genre inclination

In [160]:
ct = pd.crosstab(df['Age_Group'],[df['Action/Adventure'],df['Romance'],df['Biography']])

In [161]:
ct

Action/Adventure,85,141,179,220
Romance,101,68,159,216
Biography,4,4,7,4
Age_Group,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
18-29,0,1,0,0
30-44,0,0,1,0
45-64,0,0,0,1
65&Older,1,0,0,0
