# Question
What are the odds of seeing a recommendation that ideologically diverges from the video being currently watched?
# Hypothesis
YouTube keeps people in the same ideological bubble most of the time.
# Test

Retrieve data from database

In [2]:
library(RSQLite)
con <- dbConnect(drv=RSQLite::SQLite(), dbname=".//youtube_recommendations.sqlite")
recommendation <- dbGetQuery(conn=con, statement="SELECT * FROM recommendation")
dbDisconnect(conn=con)
nrow(recommendation)

Remove rows where channel_id is NA/NULL

In [3]:
recommendation <- recommendation[complete.cases(recommendation[, c('seed_channel_id','recommended_channel_id')]), ]
nrow(recommendation)

Remove rows where political_leaning is NA/NULL (both for seed and recommended videos)

In [4]:
recommendation <- recommendation[complete.cases(recommendation[, c('seed_political_leaning', 'recommended_political_leaning')]), ]
nrow(recommendation)

Convert `char` columns to `factor`

In [5]:
recommendation[['seed_political_leaning']] = as.factor(recommendation[['seed_political_leaning']])
recommendation[['recommended_political_leaning']] = as.factor(recommendation[['recommended_political_leaning']])

In [16]:
table(recommendation$recommended_political_leaning)
table(recommendation$recommended_political_leaning)/nrow(recommendation)
table(recommendation$seed_political_leaning, recommendation$recommended_political_leaning)


 LEFT RIGHT 
12123 36859 


     LEFT     RIGHT 
0.2474991 0.7525009 

       
         LEFT RIGHT
  LEFT   9600  4587
  RIGHT  2523 32272

**The lazy hypothesis (all videos are right-wing) would have an accuracy of 75% in this dataset.**

Create train and test sets

In [9]:
set.seed(2)
train <- rep(TRUE, nrow(recommendation))
train[sample.int(length(train), 0.3*nrow(recommendation))] <- FALSE
test <- (!train)

In [11]:
contrasts(recommendation$recommended_political_leaning)

Unnamed: 0,RIGHT
LEFT,0
RIGHT,1


## recommended_political_leaning ~ seed_political_leaning

I am running this logistic regression without the intercept but the results are the same with and without the intercept (the only thing that changes is the coding of `seed_political_leaning`).

In [22]:
glm.model <- glm(recommended_political_leaning~seed_political_leaning-1, data=recommendation[train,], family=binomial)
summary(glm.model)


Call:
glm(formula = recommended_political_leaning ~ seed_political_leaning - 
    1, family = binomial, data = recommendation[train, ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2951   0.3861   0.3861   0.3861   1.4964  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
seed_political_leaningLEFT  -0.72442    0.02140  -33.85   <2e-16 ***
seed_political_leaningRIGHT  2.55922    0.02482  103.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 47533  on 34288  degrees of freedom
Residual deviance: 25128  on 34286  degrees of freedom
AIC: 25132

Number of Fisher Scoring iterations: 5


**Training set**

In [23]:
glm.probabilities <- predict(glm.model, recommendation[train,], type="response")
recommendation[train, 'predicted'] <- ifelse(glm.probabilities < .5,
                                             levels(recommendation$recommended_political_leaning)[1],
                                             levels(recommendation$recommended_political_leaning)[2])
classification.table <- table(recommendation[train, 'recommended_political_leaning'], recommendation[train,'predicted'])
classification.table
round(classification.table/nrow(recommendation[train,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
         LEFT RIGHT
  LEFT   6690  1749
  RIGHT  3242 22607

       
         LEFT RIGHT
  LEFT  0.195 0.051
  RIGHT 0.095 0.659

[1] "Accuracy:"


**Test set**

In [24]:
glm.probabilities <- predict(glm.model, recommendation[test,], type="response")
recommendation[test, 'predicted'] <- ifelse(glm.probabilities < .5,
                                            levels(recommendation$recommended_political_leaning)[1],
                                            levels(recommendation$recommended_political_leaning)[2])
classification.table <- table(recommendation[test, 'recommended_political_leaning'], recommendation[test,'predicted'])
classification.table
round(classification.table/nrow(recommendation[test,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT  2910   774
  RIGHT 1345  9665

       
         LEFT RIGHT
  LEFT  0.198 0.053
  RIGHT 0.092 0.658

[1] "Accuracy:"


**We can reject the null hypothesis for the main question posed in this notebook: there is an unmistakable tendency of staying in the same ideological confines when YouTube recommends a video.**

Another interesting question: **is rank important in determining the ideological leaning of a video?**

Let's first test with just the first recommendation.

In [46]:
recommendation_1 <- recommendation[recommendation$rank == 1, ]
nrow(recommendation_1)

In [47]:
set.seed(2)
train_1 <- rep(TRUE, nrow(recommendation_1))
train_1[sample.int(length(train_1), 0.3*nrow(recommendation_1))] <- FALSE
test_1 <- (!train_1)

In [48]:
glm.model <- glm(recommended_political_leaning~seed_political_leaning-1, data=recommendation_1[train_1,], family=binomial)
summary(glm.model)


Call:
glm(formula = recommended_political_leaning ~ seed_political_leaning - 
    1, family = binomial, data = recommendation_1[train_1, ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5536   0.2797   0.2797   0.2797   1.5161  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
seed_political_leaningLEFT  -0.76820    0.06562  -11.71   <2e-16 ***
seed_political_leaningRIGHT  3.22135    0.10354   31.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4992.0  on 3601  degrees of freedom
Residual deviance: 2162.9  on 3599  degrees of freedom
AIC: 2166.9

Number of Fisher Scoring iterations: 6


In [49]:
glm.probabilities <- predict(glm.model, recommendation_1[train_1,], type="response")
recommendation_1[train_1, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_1$recommended_political_leaning)[1],
                                                 levels(recommendation_1$recommended_political_leaning)[2])
classification.table <- table(recommendation_1[train_1, 'recommended_political_leaning'], recommendation_1[train_1,'predicted'])
classification.table
round(classification.table/nrow(recommendation_1[train_1,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT   733    97
  RIGHT  340  2431

       
         LEFT RIGHT
  LEFT  0.204 0.027
  RIGHT 0.094 0.675

[1] "Accuracy:"


In [50]:
glm.probabilities <- predict(glm.model, recommendation_1[test_1,], type="response")
recommendation_1[test_1, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_1$recommended_political_leaning)[1],
                                                 levels(recommendation_1$recommended_political_leaning)[2])
classification.table <- table(recommendation_1[test_1, 'recommended_political_leaning'], recommendation_1[test_1,'predicted'])
classification.table
round(classification.table/nrow(recommendation_1[test_1,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT   319    49
  RIGHT  119  1055

       
         LEFT RIGHT
  LEFT  0.207 0.032
  RIGHT 0.077 0.684

[1] "Accuracy:"


Now with the first three recommendations.

In [45]:
recommendation_123 <- recommendation[recommendation$rank %in% 1:3, ]
nrow(recommendation_123)

In [51]:
set.seed(2)
train_123 <- rep(TRUE, nrow(recommendation_123))
train_123[sample.int(length(train_123), 0.3*nrow(recommendation_123))] <- FALSE
test_123 <- (!train_123)

In [52]:
glm.model <- glm(recommended_political_leaning~seed_political_leaning-1, data=recommendation_123[train_123,], family=binomial)
summary(glm.model)


Call:
glm(formula = recommended_political_leaning ~ seed_political_leaning - 
    1, family = binomial, data = recommendation_123[train_123, 
    ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3764   0.3499   0.3499   0.3499   1.5294  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
seed_political_leaningLEFT  -0.79765    0.03939  -20.25   <2e-16 ***
seed_political_leaningRIGHT  2.76246    0.04949   55.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 14305.2  on 10319  degrees of freedom
Residual deviance:  7023.4  on 10317  degrees of freedom
AIC: 7027.4

Number of Fisher Scoring iterations: 5


In [53]:
glm.probabilities <- predict(glm.model, recommendation_123[train_123,], type="response")
recommendation_123[train_123, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_123$recommended_political_leaning)[1],
                                                 levels(recommendation_123$recommended_political_leaning)[2])
classification.table <- table(recommendation_123[train_123, 'recommended_political_leaning'], recommendation_123[train_123,'predicted'])
classification.table
round(classification.table/nrow(recommendation_123[train_123,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT  2076   434
  RIGHT  935  6874

       
         LEFT RIGHT
  LEFT  0.201 0.042
  RIGHT 0.091 0.666

[1] "Accuracy:"


In [54]:
glm.probabilities <- predict(glm.model, recommendation_123[test_123,], type="response")
recommendation_123[test_123, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_123$recommended_political_leaning)[1],
                                                 levels(recommendation_123$recommended_political_leaning)[2])
classification.table <- table(recommendation_123[test_123, 'recommended_political_leaning'], recommendation_123[test_123,'predicted'])
classification.table
round(classification.table/nrow(recommendation_123[test_123,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT   913   166
  RIGHT  411  2932

       
         LEFT RIGHT
  LEFT  0.206 0.038
  RIGHT 0.093 0.663

[1] "Accuracy:"


**The results confirm that recommendations are more polarized at the top of the recommendation line-up.**

Another question must be posed: maybe the accuracy can be explained by same-channel recommendations: YouTube would favor recommendations from the same channel that, by definition, belong to the same ideological camp. Let's see if the same bias remains for recommendations from different channels.

In [55]:
recommendation_different_channels = recommendation[recommendation$seed_channel_id != recommendation$recommended_channel_id, ]
nrow(recommendation_different_channels)

In [63]:
set.seed(2)
train_different_channels <- rep(TRUE, nrow(recommendation_different_channels))
train_different_channels[sample.int(length(train_different_channels), 0.3*nrow(recommendation_different_channels))] <- FALSE
test_different_channels <- (!train_different_channels)

In [67]:
glm.model <- glm(recommended_political_leaning~seed_political_leaning, data=recommendation_different_channels[train_different_channels,], family=binomial)
summary(glm.model)


Call:
glm(formula = recommended_political_leaning ~ seed_political_leaning, 
    family = binomial, data = recommendation_different_channels[train_different_channels, 
        ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1600   0.4518   0.4518   0.4518   1.2094  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -0.07505    0.02437   -3.08  0.00207 ** 
seed_political_leaningRIGHT  2.30586    0.03497   65.94  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25668  on 24885  degrees of freedom
Residual deviance: 20896  on 24884  degrees of freedom
AIC: 20900

Number of Fisher Scoring iterations: 4


In [68]:
glm.probabilities <- predict(glm.model, recommendation_different_channels[train_different_channels,], type="response")
recommendation_different_channels[train_different_channels, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_different_channels$recommended_political_leaning)[1],
                                                 levels(recommendation_different_channels$recommended_political_leaning)[2])
classification.table <- table(recommendation_different_channels[train_different_channels, 'recommended_political_leaning'], recommendation_different_channels[train_different_channels,'predicted'])
classification.table
round(classification.table/nrow(recommendation_different_channels[train_different_channels,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
         LEFT RIGHT
  LEFT   3499  1760
  RIGHT  3246 16381

       
         LEFT RIGHT
  LEFT  0.141 0.071
  RIGHT 0.130 0.658

[1] "Accuracy:"


In [69]:
glm.probabilities <- predict(glm.model, recommendation_different_channels[test_different_channels,], type="response")
recommendation_different_channels[test_different_channels, 'predicted'] <- ifelse(glm.probabilities < .5,
                                                 levels(recommendation_different_channels$recommended_political_leaning)[1],
                                                 levels(recommendation_different_channels$recommended_political_leaning)[2])
classification.table <- table(recommendation_different_channels[test_different_channels, 'recommended_political_leaning'], recommendation_different_channels[test_different_channels,'predicted'])
classification.table
round(classification.table/nrow(recommendation_different_channels[test_different_channels,]),3)
print('Accuracy:')
sum(diag(classification.table))/sum(classification.table)

       
        LEFT RIGHT
  LEFT  1422   763
  RIGHT 1341  7139

       
         LEFT RIGHT
  LEFT  0.133 0.072
  RIGHT 0.126 0.669

[1] "Accuracy:"
