## 6. Classification
Usar regresión logística con score arriba/debajo de la media para determinar cuáles son los atributos más relevantes para investigadores vs no investigadores.

### 4.1 $\chi^2-$test <a id='chi_test'></a>
A chi-squared test, also written as χ2 test, is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as short for Pearson's chi-squared test. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

In [None]:
from scipy.stats import chisquare

In [None]:
### 4.2 Clustering <a id='clustering'></a>


In [None]:
data_temp = data.copy()
data_temp = data_temp.dropna(subset=['score'])
#We create a classification variable for the performance of a professor, separating in two DF's
cond1 = (data_temp.score >= data_temp.score.quantile(0.75)).replace([True,False],['good',''])
cond2 = (np.logical_and(data_temp.score < data_temp.score.quantile(0.75),data_temp.score > data_temp.score.quantile(0.25))).replace([True,False],['regular',''])
cond3 = (data_temp.score <= data_temp.score.quantile(0.25)).replace([True,False],['bad',''])
data_temp['score_category'] = (cond1+cond2+cond3)

In [None]:
data_mty = data[data.Campus =='Campus Monterrey'];
data_mty = data_mty.dropna(subset=['score'])
cond1 = (data_mty.score >= data_mty.score.quantile(0.75)).replace([True,False],['good',''])
cond2 = (np.logical_and(data_mty.score < data_mty.score.quantile(0.75),data_mty.score > data_mty.score.quantile(0.25))).replace([True,False],['regular',''])
cond3 = (data_mty.score <= data_mty.score.quantile(0.25)).replace([True,False],['bad',''])
data_mty['score_category'] = (cond1+cond2+cond3)

In [None]:
#We create a classification variable for the performance of a professor, separating in two DF's
cond1 = (df_under.score >= df_under.score.quantile(0.75)).replace([True,False],['good',''])
cond2 = (np.logical_and(df_under.score < df_under.score.quantile(0.75),df_under.score > df_under.score.quantile(0.25))).replace([True,False],['regular',''])
cond3 = (df_under.score <= df_under.score.quantile(0.25)).replace([True,False],['bad',''])
df_under['score_category'] = (cond1+cond2+cond3)

In [None]:
#We create a classification variable for the performance of a professor, separating in two DF's
cond1 = (df_grad.score >= df_grad.score.quantile(0.75)).replace([True,False],['good',''])
cond2 = (np.logical_and(df_grad.score < df_grad.score.quantile(0.75),df_grad.score > df_grad.score.quantile(0.25))).replace([True,False],['regular',''])
cond3 = (df_grad.score <= df_grad.score.quantile(0.25)).replace([True,False],['bad',''])
df_grad['score_category'] = (cond1+cond2+cond3)
df_grad = df_grad[df_grad.score_category != '']

### 4.3 Logistic regression <a id='logistic_regression'></a>

1. **<span style="color:green">✓</span>** Binary logistic regression requires the dependent variable to be binary. 
2. **<span style="color:green">✓</span>** For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
3. Only the meaningful variables should be included.
4. The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
5. The independent variables are linearly related to the log odds.
6. Logistic regression requires quite large sample sizes.

In [None]:
#from sklearn.linear_model import LogisticRegression
#https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

### 4.4 Decision Tree <a id='decision_tree'></a>
Pros
- Decision trees are easy to interpret and visualize.
- It can easily capture Non-linear patterns.
- It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.
- It can be used for feature engineering such as predicting missing values, suitable for variable selection.
- The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm. (Source)

Cons
- Sensitive to noisy data. It can overfit noisy data.
- The small variation(or variance) in data can result in the different decision tree. This can be reduced by bagging and boosting algorithms.
- Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.


In [None]:
#https://www.datacamp.com/community/tutorials/decision-tree-classification-python
#https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics

#### Graduate <a id='decision_tree_grad'></a>

In [None]:
df_grad_temp = df_grad[['score_category','sni_yn','under_yn','experience','grado']].dropna()
df_grad_temp_X = df_grad_temp.drop('score_category',axis=1)
df_grad_temp_y = df_grad_temp.score_category

# Encodes catagerical data
enc = preprocessing.OrdinalEncoder()
X = df_grad_temp_X.values
enc.fit(X)
X_encoded = enc.transform(df_grad_temp_X.values);

le = preprocessing.LabelEncoder()
le.fit(np.ndarray.tolist(df_grad_temp_y.unique()))
y_encoded = le.transform(df_grad_temp_y);

#Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.33, random_state=42)

In [None]:
# Trains the tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [None]:
#Testing state
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#Draws the decision tree
import graphviz 
dot_data = tree.export_graphviz(clf,
                                out_file=None,
                                feature_names = np.ndarray.tolist(df_grad_temp_X.columns.values),
                                class_names = df_grad_temp_y.unique(),
                                filled=True,
                                rounded=True,
                                special_characters=True)  
graph = graphviz.Source(dot_data)  
#graph.render('tree_grad')

#### Undergraduate <a id='decision_tree_under'></a>

In [None]:
df_under_temp = df_under[['score_category','sni_yn','under_yn','experience','grado']].dropna()
df_under_temp_X = df_under_temp.drop('score_category',axis=1)
df_under_temp_y = df_under_temp.score_category

# Encodes catagerical data
enc = preprocessing.OrdinalEncoder()
X = df_under_temp_X.values
enc.fit(X)
X_encoded = enc.transform(df_under_temp_X.values);

le = preprocessing.LabelEncoder()
le.fit(np.ndarray.tolist(df_under_temp_y.unique()))
y_encoded = le.transform(df_under_temp_y);

#Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.33, random_state=42)

In [None]:
# Trains the tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [None]:
#Testing state
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#### Total

In [None]:
data_temp2 = data_temp[['score_category','sni_yn','under_yn','experience','grado']].dropna()
data_temp2_X = data_temp2.drop('score_category',axis=1)
data_temp2_y = data_temp2.score_category

# Encodes catagerical data
enc = preprocessing.OrdinalEncoder()
X = data_temp2_X.values
enc.fit(X)
X_encoded = enc.transform(data_temp2_X.values);

le = preprocessing.LabelEncoder()
le.fit(np.ndarray.tolist(data_temp2_y.unique()))
y_encoded = le.transform(data_temp2_y);

#Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.33, random_state=42)

In [None]:
# Trains the tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [None]:
#Testing state
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#### Campus Mty

In [None]:
df_mty_temp = data_mty[['score_category','sni_yn','under_yn','experience','grado']].dropna()
df_mty_temp_X = df_mty_temp.drop('score_category',axis=1)
df_mty_temp_y = df_mty_temp.score_category

# Encodes catagerical data
enc = preprocessing.OrdinalEncoder()
X = df_mty_temp_X.values
enc.fit(X)
X_encoded = enc.transform(df_mty_temp_X.values);

le = preprocessing.LabelEncoder()
le.fit(np.ndarray.tolist(df_mty_temp_y.unique()))
y_encoded = le.transform(df_mty_temp_y);

#Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.33, random_state=42)

In [None]:
# Trains the tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [None]:
#Testing state
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))