## Setting environment variables:

nano .bashrc  
export VARNAME="var end point"  
alias alias_name="alias command"   
source ~/.bashrc  
echo $VARNAME

## To download a file from github

1) you can **clone** my repository, **fork** my repository,

``` 
git clone REPO PATH

```

**OR** do it **manually** as follows:

1) On the terminal go to the directory you want to download in.

2) **click on Raw** next to the notebook.Now you have 2 choices:  
    you can **copy and paste** the RAW ipython notebook (which is  a JSON file) onto a new file on your own machine (name the file FILENAME.ipynb)    
    or you can use the **wget** command on the terminal: typing 
```
wget PATH
```
will save a version of the notebook in the directory where you were when you typed the command. **wget**, which stands for web get, downloads any files, or even entire directories, from a web URL.


## steps to create and manage a git repo with github

https://github.com/fedhere/PUI2018_fb55/blob/master/Lab1_fb55/githubCreateRepoCmds.md

## Remove Sensitive data from github repos permenantly
https://help.github.com/articles/removing-sensitive-data-from-a-repository/

In [None]:
cd YOUR-REPOSITORY

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA' \
--prune-empty --tag-name-filter cat -- --all

git push origin --force --all

git push origin --force --tags

## Useful packages

In [29]:
__author__ = 'Linda Jaber, CUSP NYU 2018'
from __future__ import print_function

import sys
import os
import scipy.stats


try:
    import urllib2 as urllib
except ImportError:
    import urllib.request as urllib
    
    
import pandas as pd


import matplotlib.pyplot as plt
import pylab as pl
%pylab inline


import requests
import json
import io

import geopandas as gpd
import shapely
from fiona.crs import from_epsg
import pysal as ps

import scipy as sp
from scipy import stats

pylab.rcParams['figure.figsize'] = 12, 8
pylab.rcParams['figure.dpi'] = 100

DEVELOPING = False
#if DEVELOPING:
    #cb2015 = cb201501[::1000]
#else:
    #cb2015 = pd.concat([cb201501, cb201506])

from getCitiBikeCSV import getCitiBikeCSV # must have this function locally or in path
from get_jsonparsed_data import get_jsonparsed_data # must have this function locally or in path


Populating the interactive namespace from numpy and matplotlib


In [None]:
plt.rcParams

## show an image inside a notebook

In [None]:
from IPython.display import Image
Image(filename='../plotsforclasses/NYCReentryprogram_title.png')

## Setting environment variable PUIDATA

In [22]:
puidata = os.getenv('PUIDATA')
if puidata is None:
    os.environ['PUIDATA']='%s/PUIdata'%os.getenv('HOME')
    puidata = os.getenv('PUIDATA')
print('puidata: ', puidata) 

puidata:  /nfshome/lj1232/PUIdata


## Downloading zipped data and uppacking it into PUIdata

In [None]:
# downloading a zipped file
!curl -O <LINK>
# unpacking into $PUIDATA
!unzip <FILENAME.zip> -d $PUIDATA

# if it is not a zip file download using curl and then move
!curl <LINK> > <FILENAME>
!mv <FILENAME> $PUIDATA

# list files in PUIdata
!ls $PUIDATA

## CSV

In [None]:
# read from a file in PUIdata
df = pd.read_csv(puidata + 'FILENAME')

OR

url = 'PATH'

#reading in directly from a url (raw data link if on github)
pd.read_csv(url)

# use API and to read a json file in pandas
df = pd.read_json(url)

## Important pandas methods

In [None]:
df.head()
df.tail()
df.columns()
#count how many rows, those are the data points
df.size()

#select 2 columns only
df_2 = df[['col1', 'col2']]

# rename
df.rename(columns = {'date_of_census':'Date of Census', 
               'total_children_in_shelter':'Total Children in Shelter', 
               'adult_families_in_shelter': 'Adult Families in Shelter'}, inplace=True)
OR
df.columns([<LIST OF ALL COLUMNS NAMES>])

# change into a datetime format
pd.to_datetime(df_3c['Date of Census'])



## Json

In [None]:
# API retreive data as json
url = 'https://data.cityofnewyork.us/resource/wece-v9d7.json'
response = urllib.urlopen(url)
data = response.read().decode('utf-8')
data = json.loads(data)

url = 'https://api.census.gov/data/2016/acs/acs1/variables.json'
response = requests.request('GET', url)
aff1y = json.loads(response.text)

# use API and to read a json file in pandas as a dataframe
df = pd.read_json(url)

# local  file
with open('VM.json') as f:
    data = json.load(f)
data.keys()

s = json.load(open(puidata + "/FILENAME") )

## Plotting

every plot needs a caption that explains to the reader   
1) WHAT what the reader is loooking at   
2) WHY why the reader is looking at it here in the analysis   
3) TAKE HOME what is the take home point for the plot in the analysis 

In [None]:
pl.figure(figsize=(8,8))
for i in range(50):
    pl.plot(ReprRandAll[i,0], ReprRandAll[i,1], '.', alpha=0.3)
    pl.xlabel("x value", fontsize=20)
    pl.ylabel("y value", fontsize=20)
    pl.axes().set_aspect("equal")
    
# if the axis legend is not set all
ax.set_xlabel('')

pl.legend()
ax.xaxis.set_ticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])



## Statistics

### Z test:
tests a difference between means. 


$Z = \frac{|M - \mu|}{\frac{\sigma}{\sqrt(N)}}$

In [None]:
z = np.abs(oldMean - newMean) / (oldStd / np.sqrt(N))
print ("Z statistics\nZ = {0:.2f}".format(z))

This is in units of standard deviations (sort of)! 
    - 0.05 is 2 standard deviations
    - 2.56 > 2 so I am farther than 2 standard deviations from the mean. 
(in reality for a 2-tailed test the threshold for 0.05 significance is 1.96, not 2)
    
** We reject the null hypothesis, p-value *p* < 0.05 **    

## t-test

for one sample and an assumption of the poulation

$t = \frac{\mu - M}{\frac{\sigma}{\sqrt(N)}}$

for two samples (unpaired test)

t = $\frac{M_1 - M_2} {\sqrt {s^2 ({ \frac{1} {n_0} + \frac{1} {n_1}})} }$

### Hypothesis Test: Difference Between Proportions:
## Z-test
We want to conduct a hypothesis test to determine if there is a significant difference between the two proportions. First we will conduct a one-tailed **two-proportion z-test**. We will use a binommial distribution since it is a yes/no test.<br>
Reference: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx?Tutorial=AP

pooled sample proportion:   p = $\frac{p_0 n_0  + p_1 n1}{n_0 + n_1}$

sdanrdard error:  SE = $\sqrt{ \frac{ p(1 - p)} {n_0} + \frac{ p(1 - p)} {n_1} }$ 

z-statistic:  z = $\frac{p_1 - p_0} {SE}$

### Null Hypothesis:
The proportion of prisoners convicted of a felony after being released is the same or larger for individuals who participated in the CEO program compared to those in the control group at **significance level** $\alpha$ = 0.05

**Control Group:** $P_0$ = 11.7% <br>
**Program Group:** $P_1$ = 10.0%

$H_0: P_0 - P_1 \geq$ 0  
$H_a: P_o - P_1$ < 0 <br>
$\alpha$ = .05

In [None]:
# set lambda functions for p, SE and z
p = lambda p0, p1, n0, n1: (p0 * n0 + p1 *n1) / (n0 + n1)
SE = lambda p, n0, n1: np.sqrt(p * (1-p) * (1 / n0 + 1.0 / n1))
z = lambda p0, p1, SE : (p0 -p1) / SE

# set new sample proportions
# note all the other values are still the same

p0 = 11.7 * 0.01
p1 = 10.0 * 0.01

if p0 - p1 <= 0:
    print ('The Null holds.')
else:
    print ('We must assess the statistical significance.')

In [None]:
# calculate the z-score
z1 = z(p0, p1, SE(p(p0, p1, n0, n1), n0, n1))
print ('z-statistic = {0:.2f}'.format(z1))

**table**

In [None]:
Image('http://intersci.ss.uci.edu/wiki/images/3/3a/Normal01.jpg')

In [None]:
# use the z-table to calculate the p-value
P = 0.8023
p_value = 1- 0.8023

#interpret the results
print ('p_value = %.2f'%p_value)
def result(a,p):
    print ('Is the p-value = {0:.2f} less than the critical value = {1:.2f}?'\
           .format(p,a))
    if p < a:
        print ('YES!')
    else: 
        print ('NO!')
    print ('Then the Null hypothesis {}'\
           .format('is rejected' if p < a else 'holds.'))
    
result(alpha, p_value)

## chi squares tes
After that we will conduct a **Chi Squared test**, and use the $\chi^2$ distribution. A chi square ($\chi^2$) satistic is used to investigate whether ditributions of categorical variables differ from one another. We calculate the $\chi^2$ statistic and compare it to the chi square distribution and see how far in the tail it is. <br>
The chi square statistic compares the counts (actual numbers and not percentages, proportions or means, etc.) of categorical responses between two or more independent groups. <br>
Reference: http://math.hws.edu/javamath/ryan/ChiSquare.html

chi squared sattistic: $\chi^2$ = $\sum_{i = 1 j =1}^{rc} \frac{(O_{ij} - E_{ij})^2} {E_{ij}} $

expected value: $E_{ij} = \frac {ith\ row\ total\  jth\ row\ total} {grand total}$

degrees of freedom: df = (r-1)(c-1)

where **E** is expected, **O** is observed, **r** is row, **c** is column, **ij** is the row and column number of a cell.

### Chi Square test
### Null Hypothesis:
$H_0$: felony conviction and participation in a CEO  program are independent <br>
$H_a$: felony conviction and participation in a CEO  program are not independent <br>
$\alpha$ = .05
#### Construct a Contingency Table

| Convicted of a felony |    YES         |   NO           |    Total       |
|-----------------------|:---------------|:---------------|:---------------|
| Control Group         |   0.117 * 409  |   0.883 * 409  |     409        |     
| program Gorup         |   0.1 * 568    |   0.9 * 568    |     568        |
|                       |
| Total                 |  104.653       |  872.347       |     977        |

**Manually**

In [None]:
# observed values
observed = np.array([[ 0.117 * 409, 0.883 * 409 ], [ 0.1 * 568, 0.9 * 568 ] ])

def chisq(o):
    if not (len(o.shape) == 2 and o.shape == (2,2)):
        print ("must pass a 2D array")
        return -1
    E = np.empty_like(o)
    for j in range(len(o[0])):
        for i in range(2):
            
            E[i][j] = ((o[i,:].sum() * o[:,j].sum()) / 
                        (o).sum())
    return ((o - E)**2 / E).sum()

print ('Chi Square Statistic = {}'.format(chisq(observed)))

# degrees of freedom ((r-1)(c-1))
df = (len(observed) - 1) * (len(observed[0,:]) - 1)   

print ('Degrees of Freedom = {}'.format(df))

**TABLE**

In [None]:
Image("http://passel.unl.edu/Image/Namuth-CovertDeana956176274/chi-sqaure%20distribution%20table.PNG")

** using scipy **

In [None]:
# Another way to do calculate it:
# using the scipy.stats.chi2_contingency the chi square statistic, the p-value, 
# the degrees of freedom and the expected values

chi2, p, dof, expected = scipy.stats.chi2_contingency(observed)

print('Chi Square Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n'.format( chi2, p, dof ) )
print( 'observed = {}\nexpected = {}'.format(observed, expected) )

**CONCLUSION**

In [None]:
chisq =  0.718
dof = 1

# at an alph level 0.05 and with one degrees of freedom,
# the crtical value we get from the table is 3.84
critical_value = 3.84

def result(x, c):
    print ('Is the chi square statistic = ' + 
           '{0:.3f} bigger than the critical value = {1:.2f}?'.format(x,c))
           
    if x > c:
        print ('YES!')
    else:
        print ('NO!')
    
    print ('Then the Null hypothesis {}'.format(
        'is rejected.' if x > c  else 'holds.') )

    
result(chisq, critical_value)

IDEA:
Women are less likely than men to choose biking _for commuting_

NULL HYPOTHESIS:
The proportion of men biking on weekends is _the same_ or _higher_  than the proportion of women biking on weekends  

_$H_0$_ : $\frac{W_{\mathrm{weekend}}}{W_{\mathrm{total}}} <= \frac{M_{\mathrm{weekend}}}{M_{\mathrm{total}}}$  

_$H_1$_ : $\frac{W_{\mathrm{weekend}}}{W_{\mathrm{total}}} > \frac{M_{\mathrm{total}}}{M_{\mathrm{total}}}$  

or identically:  

_$H_0$_ : $\frac{W_{\mathrm{weekend}}}{W_{\mathrm{total}}} - \frac{M_{\mathrm{weekend}}}{M_{\mathrm{total}}} <= 0 $   

_$H_1$_ : $\frac{W_{\mathrm{weekend}}}{W_{\mathrm{total}}} - \frac{M_{\mathrm{weekend}}}{M_{\mathrm{total}}} > 0$  
I will use a significance level  $\alpha=0.05$  

which means i want the probability of getting a result at least as significant as mine to be less then 5%