# Assessing Wikipedia Bias

## 1. You will need to collect data from a source of your choosing (dataset, wikipedia API, web-scraping)

## Introduction 

The project 

## Data Overview

In [1]:
# Import necessary libraries
import pandas as pd
import re
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats as st

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")


In [2]:
# Load the datasets
data = pd.read_csv('final_labels.csv', sep=';')
# Display the first few rows of the dataset
display(data.head())

Unnamed: 0,text,news_link,outlet,topic,type,group_id,num_sent,label_bias,label_opinion,article,biased_words
0,YouTube is making clear there will be no “birt...,https://eu.usatoday.com/story/tech/2020/02/03/...,usa-today,elections-2020,center,1,1,Biased,Somewhat factual but also opinionated,YouTube says no ‘deepfakes’ or ‘birther’ video...,"['belated', 'birtherism']"
1,So while there may be a humanitarian crisis dr...,https://www.alternet.org/2019/01/here-are-5-of...,alternet,immigration,left,1,1,Biased,Expresses writer’s opinion,Speaking to the country for the first time fro...,['crisis']
2,"Looking around the United States, there is nev...",https://thefederalist.com/2020/03/11/woman-who...,federalist,abortion,right,1,1,Biased,Somewhat factual but also opinionated,The left has a thing for taking babies hostage...,"['killing', 'never', 'developing', 'humans', '..."
3,The Republican president assumed he was helpin...,http://www.msnbc.com/rachel-maddow-show/auto-i...,msnbc,environment,left,1,1,Biased,Expresses writer’s opinion,"In Barack Obama’s first term, the administrati...","['rejects', 'happy', 'assumed']"
4,The explosion of the Hispanic population has l...,https://www.breitbart.com/politics/2015/02/26/...,breitbart,student-debt,right,1,1,Biased,No agreement,"Republicans should stop fighting amnesty, Pres...",['explosion']


### Data preprocessing

In [4]:
# Display the column names of the dataset
column_names = data.columns.tolist()
display(column_names)

['text',
 'news_link',
 'outlet',
 'topic',
 'type',
 'group_id',
 'num_sent',
 'label_bias',
 'label_opinion',
 'article',
 'biased_words']

In [5]:
# Display the shape of the dataset
n_rows, n_cols = data.shape
print(f"The DataFrame has {n_rows} rows and {n_cols} columns")

The DataFrame has 1700 rows and 11 columns


In [6]:
# Display the informative summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   text           1700 non-null   object
 1   news_link      1681 non-null   object
 2   outlet         1700 non-null   object
 3   topic          1700 non-null   object
 4   type           1700 non-null   object
 5   group_id       1700 non-null   int64 
 6   num_sent       1700 non-null   int64 
 7   label_bias     1700 non-null   object
 8   label_opinion  1700 non-null   object
 9   article        1595 non-null   object
 10  biased_words   1700 non-null   object
dtypes: int64(2), object(9)
memory usage: 146.2+ KB


In [7]:
# Display the descriptive statistics of the dataset
data.describe()

Unnamed: 0,group_id,num_sent
count,1700.0,1700.0
mean,43.0,1.124706
std,24.542908,0.414256
min,1.0,1.0
25%,22.0,1.0
50%,43.0,1.0
75%,64.0,1.0
max,85.0,5.0


## 2. You will conduct EDA that you see fit to appropriately investigate text of wikipedia articles you look to predict on for biased terms, sentiment, or other linguistic significance.

## Explorating Data Analysis

### Duplicates

In [8]:
# Display the number of duplicates in the dataset
duplicates = data[data.duplicated()]
display(f"Number of duplicated data: {duplicates.shape[0]}")

'Number of duplicated data: 0'

### Missing Values

In [9]:
# Display the number of missing values in the dataset
display(data.isna().sum())

# Check for missing values in the DataFrame as a percentage
display(data.isna().sum()/len(data)) 

text               0
news_link         19
outlet             0
topic              0
type               0
group_id           0
num_sent           0
label_bias         0
label_opinion      0
article          105
biased_words       0
dtype: int64

text             0.000000
news_link        0.011176
outlet           0.000000
topic            0.000000
type             0.000000
group_id         0.000000
num_sent         0.000000
label_bias       0.000000
label_opinion    0.000000
article          0.061765
biased_words     0.000000
dtype: float64

In [10]:
# Drop rows with missing values in the 'news_link' and 'article' columns
data.dropna(subset=['news_link'], inplace=True)
data.dropna(subset=['article'], inplace=True)

In [11]:
# Check for missing values in the DataFrame as a percentage
display(data.isna().sum()/len(data)) 

text             0.0
news_link        0.0
outlet           0.0
topic            0.0
type             0.0
group_id         0.0
num_sent         0.0
label_bias       0.0
label_opinion    0.0
article          0.0
biased_words     0.0
dtype: float64

In [12]:
# Cleaning the text data in the 'text' column
# Define a function to clean the text data 
def clear_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"[^a-z\s]","", text)
    text = text.split()
    return " ".join(text)

In [13]:
# Apply the clear_text function to the 'comment_text' column
data['clean_text'] = data['text'].astype(str).apply(clear_text) 
data= data.drop(columns=['text'])

# Display the first 5 rows of the comments DataFrame after cleaning
display(data.sample(5)) 


Unnamed: 0,news_link,outlet,topic,type,group_id,num_sent,label_bias,label_opinion,article,biased_words,clean_text
837,https://www.breitbart.com/europe/2020/02/07/fe...,breitbart,gender,right,72,1,Biased,Somewhat factual but also opinionated,"Over 70,000 women in Germany have been subject...","['gruesome', 'ritual']",female genital mutilation is illegal in german...
1561,https://www.breitbart.com/clips/2019/01/20/bra...,breitbart,gender,right,62,1,Non-biased,No agreement,"Brandon Straka, founder of the #WalkAway campa...",['left-wing'],brandon straka founder of the walkaway campaig...
1024,https://www.alternet.org/2020/06/awful-news-fo...,alternet,elections-2020,left,7,1,No agreement,Expresses writer’s opinion,According to an analysis by Politico’s Jeff Gr...,[],equally important he notes is that it is diffi...
254,https://www.breitbart.com/politics/2020/02/01/...,breitbart,abortion,right,23,1,Biased,Somewhat factual but also opinionated,"Kristen Day, executive director of Democrats f...","['extreme', 'pro-abortion']",day said that some democrats had phoned her to...
20,https://www.nbcnews.com/news/latino/honduran-m...,msnbc,immigration,left,2,1,Biased,No agreement,Get breaking news alerts and special reports. ...,['caravan'],lessorganized migrants tighter immigration con...


In [14]:
# Check for missing values
print(data['clean_text'].isna().sum())  

0


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1576 entries, 0 to 1699
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   news_link      1576 non-null   object
 1   outlet         1576 non-null   object
 2   topic          1576 non-null   object
 3   type           1576 non-null   object
 4   group_id       1576 non-null   int64 
 5   num_sent       1576 non-null   int64 
 6   label_bias     1576 non-null   object
 7   label_opinion  1576 non-null   object
 8   article        1576 non-null   object
 9   biased_words   1576 non-null   object
 10  clean_text     1576 non-null   object
dtypes: int64(2), object(9)
memory usage: 147.8+ KB


In [16]:
## Set of English stop words
stop_words =  set(stopwords.words('english')) 

In [17]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer() 

def lemmatize(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in stop_words]
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmas)

In [20]:
# Apply the clear_text function to the 'comment_text' column
data['lemmatize_text'] = data['clean_text'].apply(lemmatize) 

In [21]:
# Display the first 5 rows of the comments DataFrame after cleaning
display(data[['clean_text', 'lemmatize_text']].head(20))

Unnamed: 0,clean_text,lemmatize_text
0,youtube is making clear there will be no birth...,youtube making clear birtherism platform year ...
1,so while there may be a humanitarian crisis dr...,may humanitarian crisis driving vulnerable peo...
2,looking around the united states there is neve...,looking around united state never enough welfa...
3,the republican president assumed he was helpin...,republican president assumed helping industry ...
4,the explosion of the hispanic population has l...,explosion hispanic population longterm job pro...
5,the antivaccine movement made headlines last s...,antivaccine movement made headline last spring...
6,voting in quasimilitarized settings was not co...,voting quasimilitarized setting confined natio...
7,but one glaring absentee was trump who not onl...,one glaring absentee trump declined invitation...
9,track and field athletes dont typically earn t...,track field athlete dont typically earn lucrat...
10,in other words the agency responsible for prot...,word agency responsible protecting consumer wa...


In [22]:
data.shape

(1576, 12)

## 3. You will conduct supervised learning to be able to predict if a given text is biased. You might want to be able to do this on the sentence by sentence level.

## 4. You need to have a prediction function that can take in a new wikipedia article and predict how biased it is. You can do this by predicting if each sentence in an article is biased, then perhaps scaling the results by the length of the article to get somewhat of a“bias score”