# Logistic Regression

## Why Not Just Use A Linear Regression?

### Assumptions for Linear Models:
- Gaussian distribution of residuals (errors)
- Y (target variable) is continuous on the prediction interval
![alt text](images/binary.png "Logo Title Text 1")

### Finding A Decision Boundary
![alt text](images/lr1.png "Logo Title Text 1")

### Log of Equal Odds
![alt text](images/lr2.png "Logo Title Text 1")

### Logit Link Function
![alt text](images/lr3.png "Logo Title Text 1")

### Solving for Each Class (Binary Target)
![alt text](images/lr4.png "Logo Title Text 1")

### Log Likelihood
![alt text](images/lr5.png "Logo Title Text 1")

In [None]:
import pandas as pd
import numpy as np
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [None]:
poor = open("poor_amazon_toy_reviews.txt").readlines()
good = open("good_amazon_toy_reviews.txt").readlines()

good_reviews = list(map(lambda review: (review, 1), good))
poor_reviews = list(map(lambda review: (review, 0), poor))

all_reviews = good_reviews + poor_reviews
all_reviews_df = pd.DataFrame(all_reviews, columns=["review", "positive"])
all_reviews_df.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 1),
                             stop_words="english",
                             max_features=1000,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')

In [None]:
X = vectorizer.fit_transform(all_reviews_df["review"])
y = all_reviews_df["positive"].values
X

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X, y)

In [None]:
y_pred = lr.predict(X)

# calculate accuracy
np.mean(y_pred == y)

from sklearn.metrics import confusion_matrix

confusion_matrix(y, y_pred)

## AUROC (Area Under the Receiver Operator Curve)

![alt text](images/auroc.png "AUROC")


> *The probability a randomly-chosen positive example is ranked more highly than a randomly chosen negative example”, which then can be further interpreted as "**the probability that two randomly-selected samples are correctly ranked**"* [Understanding AUC Pros and Cons](https://medium.com/@penggongting/understanding-roc-auc-pros-and-cons-why-is-bier-score-a-great-supplement-c7a0c976b679)

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, y_pred)

In [None]:
data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data["TARGET"] = y

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data)
X_train = train_df.loc[:, ~train_df.columns.isin(['TARGET'])]
X_test = test_df.loc[:, ~test_df.columns.isin(['TARGET'])]


y_train = train_df["TARGET"]
y_test = test_df["TARGET"]

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
lr.fit(X_train, y_train)

In [None]:
y_pred = lr.predict(X_test)

np.mean(y_pred == y_test)

### Why AUROC?

* Easier to visualize the tradeoff between sensitive and specificity.
* Ability to visually see the impact of cut-offs on model performance
* Easier ability to handle class imbalance:

> *In the case of an imbalanced dataset, the stepsize is different. So, you make smaller steps to the left (if you have more negative samples). That is why the score is more or less independent of the imbalance*. [StackOverflow: Advantage of ROC Curves](https://stats.stackexchange.com/questions/28745/advantages-of-roc-curves)

## Cross Validation
* Statistical technique for evaluating performance of a machine learning model
* Mitigates the effect of **selection bias**.
* Allows us to use the entire dataset.

Traditionally, we divide up our dataset into train, test, and validation:
![test_train](images/test_train.png)

With cross validation:
![kfolds](images/kfolds.png)

[Why and How to Do Cross Validation for Machine Learning](https://towardsdatascience.com/why-and-how-to-do-cross-validation-for-machine-learning-d5bd7e60c189)

In [None]:
from sklearn.model_selection import cross_validate
X = data.loc[:, ~data.columns.isin(['TARGET'])]
cv_results = cross_validate(lr, X, y, cv=10,return_train_score=False)

In [None]:
cv_results['test_score']

# Business Use Cases of Sentiment Analysis

* **Governments**: monitor social reactions to policy decisions and politicians' overall reputations. For instance, social media commentary and sentiment played a key role in the [Arab Spring](https://en.wikipedia.org/wiki/Social_media_and_the_Arab_Spring).
* **Operations**: customer feedback on different stages of the customer lifecycle/experience can detect
* **Product Management**:
* **Digital Marketing**: AB testing of different trailers, and dynamically optimize budget allocation on Facebook and Twitter to spend more on promoting the trailer version that garners the greatest ratio of positive to negative sentiment.
* **Human Resources**: Written performance reviews by human managers often tend to be skewed (in either direction). Using a sentiment analysis model to benchmark "average performance management sentiment" can help calibrate performance reviews so that employees who are reviewed by extremely strict managers are not unduly penalized or denied bonuses.