# NYTimes Bias in Recommendations?

The key question is whether people are biased when it comes to recommending comments depending on if the name "sounds" male or female.

IMPORTANT: to make this less time consuming, the code below filters for only the articles and comments in the period 3/1-3/15.

In [1]:
df <- read.csv("gendered_nytimes_comments.csv", stringsAsFactors=FALSE)

In [None]:
target <- as.numeric(substr(df$pub_date, 9, 10)) <= 15
df <- df[target, ]

In [8]:
dim(df)

In [9]:
head(df, 3)

X.1,display_name,X,uri,num_rec,update_date,approve_date,editorsSelection,word_count,uniq_word_count,tot_comms,rank,time_gap,lastname,firstname,pub_date,news_desk,label
1,__,2704386,nyt://article/93d985f5-e72e-5ff8-b1e5-d7668f31b4fb,4,1583961033,1583883896,False,55,48,75,47,27365,Austen,Ian,2020-03-10T15:44:21+0000,Foreign,unknown
2,__,2704387,nyt://article/93d985f5-e72e-5ff8-b1e5-d7668f31b4fb,4,1583961033,1583883896,False,55,48,75,53,27365,Austen,Ian,2020-03-10T15:44:21+0000,Foreign,unknown
3,__,2704388,nyt://article/93d985f5-e72e-5ff8-b1e5-d7668f31b4fb,4,1583961033,1583883896,False,55,48,75,3,27365,Austen,Ian,2020-03-10T15:44:21+0000,Foreign,unknown


#### Just try it!
- Try regressing the number of recommendations by the gender label, the total number of comments (a proxy for readership), the time order of the comment, and whether it was endorced by NYTimes.
- make sure to convert the gender label NA values to a third gender label
- Does gender matter? What matters the most?
- remember to diagnose your model, give the data size, you may want to **down sample** the data before plotting

Which of the assumptions is violated?
- linearity $Y = X\beta + \epsilon$
- $E(\epsilon|X) = 0$
- $epsilon_i$ are independent from one another
- $Var(\epsilon|X) = \sigma^2 \neq \sigma^2(X)$

#### Let's transform the data a bit
- It is common to apply the log(X + 1) transformation on count data, plot the histograms to see the difference before/after the transformation. (don't forget to down sample when plotting!)

#### Refit the OLS using the new variables

Do not forget to diagnose the data once again!

### Sanity check your OLS results, interpret the inferred coefficients and articulate its findings

Do any of them seem funny to you? If so, how would you triage the issue (hint, visualize it)

If you are ahead of schedule, try the following changes and re-fit the OLS:
- The variance for comments from different articles is quite large. Create a feature that contains the "average Y for the article" for each comment and add this to your features. Re-fit the OLS one more time!
- Some features have a really wide range where most articles do not even have comments in those ranges (e.g. rank), how could you transform this feature? What happens when only one or two articles have features in the large ranges?
- try adding word count and unique word count into your model, is it worth it?
- Food for thought, how do you know when to stop?