In this project stage you will do some data analysis on the integrated and cleaned table, to infer insights. This analysis is something of your own choosing. But it must involve one of the key techniques that we will cover in the class: classification, clustering, correlation discovery, anomaly detection, or OLAP-style exploration. I will discuss more in the class. 

What to submit

Submit the following on your group's website: 
- a CSV file storing Table E, the integrated table which is the output of project stage 4. 
- a pdf file that discusses the following issues: 
    - Statistics on Table E: specifically, what is the schema of Table E, how many tuples are in Table E? Give at least four sample tuples from Table E. 
    - What was the data analysis task that you wanted to do? (Example: we wanted to know if we can use the rest of the attributes to accurately predict the value of the attribute loan_repaid.) For that task, describe in detail the data analysis process that you went through. 
    - Give any accuracy numbers that you have obtained (such as precision and recall for your classification scheme). 
    - What did you learn/conclude from your data analysis? Were there any problems with the analysis process and with the data? 
    - If you have more time, what would you propose you can do next? 


visualize thi pyplot

calculate thi Numpy, Scipy

Scikit


#  Stage 5: Data Analysis #

#### Trang Ho, Thomas Ngo, Qinyuan Sun



## Dataset
In this project stage, we will do analysis on the merged table E of two tables AOM and IPEDS. The CSV file storing table E can be found here (NOTE: Put link here)

NOTE: put some info about the merged table, such as these data from year 2006..2014

Below is the schema of table E:

| Attribute name | Description                                                                   |
| -------------- |:-----------------------------------------------------------------------------:|
| year           | The year the individual attended the conference                               |
| pid            | The individual ID                                                             |
| ipeds_aid      | The individual's affiliation ID                                               |
| ipeds_name     | The individual's affiliation name                                             |                               
| ipeds_alias    | The individual's affiliation alias                                            |
| ipeds_city     | The individual's affiliation city                                             |
| ipeds_prov     | The individual's affiliation province                                         |
| ipeds_web      | The individual's affiliation website                                          |
| GROFFER        | The individual's affiliation graduate offering                                |
| CCSIZSET       | The individual's affiliation Size and Setting by Carnegie Classification 2010 |
| INSTSIZE       | The individual's affiliation institution size category                        |
| CBSATYPE       | The individual's affiliation CBSA Type Metropolitan or Micropolitan           |

NOTE: The possible categorical values of GROFFER, CCSIZSET, INSTSIZE, CBSATYPE can be found here (NOTE: Put link here)

There are total of 35585 tuples in table E. Below are 5 sample tupes from E

In [2]:
import os
import pandas as pd

working_dir = os.path.dirname(os.getcwd())
path_to_csv_dir = working_dir + os.sep + 'data'+ os.sep
data = pd.read_csv(path_to_csv_dir + '_aom_mapped_v2.csv')
data_school = pd.read_csv(path_to_csv_dir + '_aom_filtered_school.csv')

data.head(5)

data_school.head(5)

Unnamed: 0,ipeds_aid,ipeds_name,ipeds_alias,ipeds_city,ipeds_prov,ipeds_web,GROFFER,CCSIZSET,INSTSIZE,CBSATYPE
0,100663,university of alabama at birmingham,0,birmingham,alabama,www.uab.edu,1,15,4,1
1,100706,university of alabama in huntsville,uah |university of alabama huntsville,huntsville,alabama,www.uah.edu,1,12,3,1
2,100751,the university of alabama,0,tuscaloosa,alabama,www.ua.edu/,1,16,5,1
3,100830,auburn university at montgomery,aum|auburn university montgomery|auburn univer...,montgomery,alabama,www.aum.edu,1,12,3,1
4,100858,auburn university,0,auburn university,alabama,www.auburn.edu,1,15,5,1


## Analysis

In [3]:
import numpy as np
import scipy, sklearn, matplotlib
import matplotlib.pyplot as plt

### Trends in number of attendees by affliations with most attendees from 2006 - 2014

In [4]:
data_by_year = {}
for year in range(2006, 2015):
    data_by_year[year] = data[data['year'] == year]  
most_attendees_schools_by_year = {}
attendees_by_year ={}
for year in range(2006, 2015):
    attendees_by_year[year] = data_by_year[year].groupby("ipeds_aid").agg({"pid": pd.Series.nunique}).reset_index()
    most_attendees_schools_by_year[year] = attendees_by_year[year].sort_values('pid', ascending=False)
    


## Correlation between school types/sizes and number of attendees

In [5]:
print(len(data_school))

922


In [6]:
# each year 
from sklearn import linear_model
from scipy import stats
class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

# For each year 
for yr in range(2006, 2015):
    selected_school = data_school.loc[(data_school['ipeds_aid'].isin(attendees_by_year[yr]['ipeds_aid']))]
    attendee_per_school = pd.merge(selected_school, attendees_by_year[yr], on='ipeds_aid')

#    attendee_per_school.head(5)
    reg = LinearRegression()
    X = (attendee_per_school[['GROFFER','CCSIZSET','INSTSIZE','CBSATYPE']]).values
    Y = (attendee_per_school['pid']).values
    reg.fit(X,Y)
    print('Year', yr)
    print('Coefficients: \n', reg.coef_)
    print('P value:\n', reg.p)
    #print('T value:\n', reg.t)


Year 2006
Coefficients: 
 [-4.58057063  0.09410764  3.67850809 -2.32049815]
P value:
 [[  2.08432299e-04   4.22956538e-01   0.00000000e+00   8.27720353e-03]]
Year 2007
Coefficients: 
 [-4.67011759  0.14222568  3.65833823 -2.05064062]
P value:
 [[  1.33313468e-04   2.65111173e-01   0.00000000e+00   1.88771136e-02]]
Year 2008
Coefficients: 
 [-3.74103525  0.04831408  3.86492662 -2.62325267]
P value:
 [[ 0.00097235  0.67236023  0.          0.00345529]]
Year 2009
Coefficients: 
 [-4.50983243  0.06434592  3.91027156 -2.09771174]
P value:
 [[  1.32412459e-04   6.05902973e-01   0.00000000e+00   1.59609362e-02]]
Year 2010
Coefficients: 
 [-4.64864721  0.09302469  3.92050257 -2.22203631]
P value:
 [[  2.54228081e-04   4.55146434e-01   0.00000000e+00   2.47491045e-02]]
Year 2011
Coefficients: 
 [-4.3743      0.05756542  3.46904807 -1.26588787]
P value:
 [[  2.83490718e-05   5.85995995e-01   0.00000000e+00   8.21552903e-02]]
Year 2012
Coefficients: 
 [-4.2549058   0.0510099   3.83771044 -2.163562

In [10]:
# Combine all the years
attendees_aggr = data.groupby("ipeds_aid").agg({"pid": pd.Series.nunique}).reset_index()
selected_school = data_school.loc[(data_school['ipeds_aid'].isin(attendees_aggr['ipeds_aid']))]
attendee_per_school = pd.merge(selected_school, attendees_aggr, on='ipeds_aid')

reg = LinearRegression()
X = (attendee_per_school[['GROFFER','CCSIZSET','INSTSIZE','CBSATYPE']]).values
Y = (attendee_per_school['pid']).values
reg.fit(X,Y)
print('Year', yr)
print('Coefficients: \n', reg.coef_)
print('P value:\n', reg.p)
#print('T value:\n', reg.t)

Year 2014
Coefficients: 
 [-24.77247609   0.10817666  23.03223748 -11.17416757]
P value:
 [[  4.48128645e-10   8.12213399e-01   0.00000000e+00   6.51971223e-04]]
