In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter

# Set visualization style
sns.set(style='whitegrid')

# Load the datasets
true_news = pd.read_csv('data/News_dataset/True.csv')
fake_news = pd.read_csv('data/News_dataset/Fake.csv')

print('True news shape:', true_news.shape, 'Fake news shape:',fake_news.shape)
# Add a label column to distinguish true and fake news
true_news['label'] = 'True'
fake_news['label'] = 'Fake'

# Merge the two datasets into one
news_df = pd.concat([true_news, fake_news], axis=0).reset_index(drop=True)


# Basic dataset information
print("Dataset shape:")
news_df.shape

True news shape: (21417, 4) Fake news shape: (23481, 4)
Dataset shape:


(44898, 5)

In [2]:
true_news.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


In [4]:
fake_news.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake


In [5]:
print("Column names:") 
news_df.columns

Column names:


Index(['title', 'text', 'subject', 'date', 'label'], dtype='object')

In [6]:
print("\nSample data:")
news_df.head()


Sample data:


Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


In [7]:
# Check for missing values
print("\nMissing values per column:")
news_df.isnull().sum()


Missing values per column:


title      0
text       0
subject    0
date       0
label      0
dtype: int64

## Task 1: Study how date behaves.
- Are there more fake news in a certain time period?
- Can you identify any patterns or reasons for spikes in fake news during certain periods?

In [8]:
true_news['date'][0][-5:]

'2017 '

In [9]:
fake_news['date'][0][-4:]

'2017'

We just realize that there's an extra space in the true_news date column, let's get rid of that.

In [10]:
date = []

for e in true_news['date']:
    date.append(e[:-1])
    
true_news['date'] = date

news_df = pd.concat([true_news, fake_news], axis=0).reset_index(drop=True)

In [11]:
years = []

for e in news_df['date']:
    try:
        years.append(int(e[-4:]))
    except:
        if e[-2:] == '18':
            years.append(2018)
        else:
            years.append(None)

news_df['year'] = years

news_df

Unnamed: 0,title,text,subject,date,label,year
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True,2017.0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True,2017.0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True,2017.0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True,2017.0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True,2017.0
...,...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",Fake,2016.0
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",Fake,2016.0
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",Fake,2016.0
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",Fake,2016.0


In [12]:
news_df[(news_df['year'] == 2015) & (news_df['label'] == 'Fake')]

Unnamed: 0,title,text,subject,date,label,year
36146,EVERY U.S. CITIZEN TAKEN HOSTAGE IN IRAN To Be...,Just another slap in the face to US citizens. ...,politics,"Dec 31, 2015",Fake,2015.0
36147,WATCH FUNNIEST MAN In American Politics Ridicu...,You don t want to miss this!Here s a little in...,politics,"Dec 31, 2015",Fake,2015.0
36148,"FBI POSTS $5,000 REWARD For Person Who Committ...","Americans were warned by Attorney General, Lor...",politics,"Dec 31, 2015",Fake,2015.0
36149,SWISS ARMY CHIEF WARNS CITIZENS About Explosiv...,Wouldn t it be great if we had someone in gove...,politics,"Dec 31, 2015",Fake,2015.0
36150,WOW! Sarah Palin’s Stunning AZ Vacation Home G...,"It would be great if her former running mate,...",politics,"Dec 30, 2015",Fake,2015.0
...,...,...,...,...,...,...
43332,ENTITLED IRS ETHICS LAWYER DISBARRED FOR ETHIC...,Don t you just love an entitled IRS lawyer who...,left-news,"Apr 4, 2015",Fake,2015.0
43333,[VIDEO] 16 YR OLD ARRESTED For Violent Gang Be...,This is a sad commentary on a generation who h...,left-news,"Apr 4, 2015",Fake,2015.0
43334,“Non-violence hasn’t worked”…Reverend Sam Most...,Yeah that whole taking up arms thing seems t...,left-news,"Apr 1, 2015",Fake,2015.0
43335,WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...,"In case you missed it Sen. Harry Reid (R-NV), ...",left-news,"Mar 31, 2015",Fake,2015.0


In [13]:
news_df['year'].value_counts()

year
2017.0    25904
2016.0    16470
2015.0     2479
2018.0       35
Name: count, dtype: int64

In [14]:
news_df.isna().sum()

title       0
text        0
subject     0
date        0
label       0
year       10
dtype: int64

In [15]:
news_df[pd.isna(news_df['year'])]

Unnamed: 0,title,text,subject,date,label,year
30775,https://100percentfedup.com/served-roy-moore-v...,https://100percentfedup.com/served-roy-moore-v...,politics,https://100percentfedup.com/served-roy-moore-v...,Fake,
36924,https://100percentfedup.com/video-hillary-aske...,https://100percentfedup.com/video-hillary-aske...,politics,https://100percentfedup.com/video-hillary-aske...,Fake,
36925,https://100percentfedup.com/12-yr-old-black-co...,https://100percentfedup.com/12-yr-old-black-co...,politics,https://100percentfedup.com/12-yr-old-black-co...,Fake,
37256,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,politics,https://fedup.wpengine.com/wp-content/uploads/...,Fake,
37257,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,politics,https://fedup.wpengine.com/wp-content/uploads/...,Fake,
38849,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,Government News,https://fedup.wpengine.com/wp-content/uploads/...,Fake,
38850,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,Government News,https://fedup.wpengine.com/wp-content/uploads/...,Fake,
40350,Homepage,[vc_row][vc_column width= 1/1 ][td_block_trend...,left-news,MSNBC HOST Rudely Assumes Steel Worker Would N...,Fake,
43286,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,left-news,https://fedup.wpengine.com/wp-content/uploads/...,Fake,
43287,https://fedup.wpengine.com/wp-content/uploads/...,https://fedup.wpengine.com/wp-content/uploads/...,left-news,https://fedup.wpengine.com/wp-content/uploads/...,Fake,


In [16]:
clean_df = news_df.dropna(subset = 'year')
clean_df.to_csv('data/News_dataset/clean.csv', index = False)
clean_df.isna().sum()

title      0
text       0
subject    0
date       0
label      0
year       0
dtype: int64

## Task 2: Check dataset balance.
- Is the dataset balanced in terms of true and fake news?
- What percentage of the dataset is true versus fake, and should we consider resampling or balancing the data?

In [17]:
news_df = pd.read_csv('data/News_dataset/clean.csv')

news_df

Unnamed: 0,title,text,subject,date,label,year
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True,2017.0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True,2017.0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True,2017.0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True,2017.0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True,2017.0
...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",Fake,2016.0
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",Fake,2016.0
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",Fake,2016.0
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",Fake,2016.0


In [18]:
print(news_df['label'].unique())
def check_balance(news_df):   
    total = len(news_df)
    count = news_df['label'].value_counts()
    
    # Safely get the counts for True and Fake, defaulting to 0 if the label doesn't exist
    true_count = count.get('True', 0)
    fake_count = count.get('Fake', 0)
    
    # Calculate the percentages
    true_perc = (true_count / total * 100) if total > 0 else 0
    fake_perc = (fake_count / total * 100) if total > 0 else 0
    
    # Print the results
    print(f"True news: {true_count} ({true_perc:.2f}%)")
    print(f"Fake news: {fake_count} ({fake_perc:.2f}%)")
    
    # Check for imbalance
    if true_perc > 70 or fake_perc > 70:
        print("The dataset is imbalanced. Consider resampling.")
    else:
        print("The dataset is fairly balanced.")


check_balance(news_df)

['True' 'Fake']
True news: 21417 (47.71%)
Fake news: 23471 (52.29%)
The dataset is fairly balanced.


In [19]:
for y in news_df['year'].unique():
    print(y)
    check_balance(news_df[news_df['year'] == y])

2017.0
True news: 16701 (64.47%)
Fake news: 9203 (35.53%)
The dataset is fairly balanced.
2016.0
True news: 4716 (28.63%)
Fake news: 11754 (71.37%)
The dataset is imbalanced. Consider resampling.
2018.0
True news: 0 (0.00%)
Fake news: 35 (100.00%)
The dataset is imbalanced. Consider resampling.
2015.0
True news: 0 (0.00%)
Fake news: 2479 (100.00%)
The dataset is imbalanced. Consider resampling.


Veient que si separem el dataset en anys és imbalanced però fent el conjunt ens surt balanced i com que l'objectiu és predir notícies(no de cap any en concret), no li donarem importància a l'any i eliminarem les columnes que impliquin date. Ens centrerem simplement en features relacionats amb el contigut del text (títol, text i subjecte). 

In [20]:
news_df.head()

Unnamed: 0,title,text,subject,date,label,year
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True,2017.0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True,2017.0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True,2017.0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True,2017.0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True,2017.0


In [21]:
news_df = news_df.drop(['date', 'year'], axis = 1)

news_df.to_csv('data/News_dataset/nodate.csv', index = False)

## Task 3: Explore text and title lengths.
- How does the length of articles (number of words) differ between true and fake news?
- Do fake news articles tend to have shorter or longer titles compared to true news?

In [22]:
df = pd.read_csv('data/News_dataset/nodate.csv')

In [23]:
df['text_length'] = df['text'].str.len()

df['title_length'] = df['title'].str.len()


In [24]:
true_text_length_mean = df[df['label'] == 'True']['text_length'].mean()
print(f"The mean length of the text in the true news is: {true_text_length_mean:.0f}")

true_title_length_mean = df[df['label'] == 'True']['title_length'].mean()
print(f"The mean length of the title in the true news is: {true_title_length_mean:.0f}")

fake_text_length_mean = df[df['label'] == 'Fake']['text_length'].mean()
print(f"The mean length of the text in the fake news is: {fake_text_length_mean:.0f}")

fake_title_length_mean = df[df['label'] == 'Fake']['title_length'].mean()
print(f"The mean length of the title in the fake news is: {fake_title_length_mean:.0f}")

The mean length of the text in the true news is: 2383
The mean length of the title in the true news is: 65
The mean length of the text in the fake news is: 2548
The mean length of the title in the fake news is: 94


## Task 4: Analyze the subject distribution.
- How do the subjects/topics of the news articles differ between true and fake news?
- Are certain subjects more prone to fake news?
- What would you do with this column, should we keep it or drop it?

In [25]:
# Get unique subjects for true and fake news
t_topics = true_news['subject'].unique()
f_topics = fake_news['subject'].unique()

print("Unique subjects in true news:")
print(t_topics)
print()

print("Unique subjects in fake news:")
print(f_topics)
print()

# Count and percentage of subjects in true news
t_topics_count = true_news['subject'].value_counts()
t_perc = (t_topics_count / t_topics_count.sum() * 100).round(2)

print("Subject counts in true news:")
print(t_topics_count)
print()

print("Percentage of each subject in true news:")
print(t_perc)
print()

# Count and percentage of subjects in fake news
f_topics_count = fake_news['subject'].value_counts()
f_perc = (f_topics_count / f_topics_count.sum() * 100).round(2)

print("Subject counts in fake news:")
print(f_topics_count)
print()

print("Percentage of each subject in fake news:")
print(f_perc)
print()


Unique subjects in true news:
['politicsNews' 'worldnews']

Unique subjects in fake news:
['News' 'politics' 'Government News' 'left-news' 'US_News' 'Middle-east']

Subject counts in true news:
subject
politicsNews    11272
worldnews       10145
Name: count, dtype: int64

Percentage of each subject in true news:
subject
politicsNews    52.63
worldnews       47.37
Name: count, dtype: float64

Subject counts in fake news:
subject
News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: count, dtype: int64

Percentage of each subject in fake news:
subject
News               38.54
politics           29.13
left-news          18.99
Government News     6.69
US_News             3.33
Middle-east         3.31
Name: count, dtype: float64



## Task 6: Investigate article length outliers.
- Are there any significant outliers in text or title lengths?
- Do these outliers correspond to true or fake news more frequently?

 Function to calculate outliers based on IQR
def find_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] < lower_bound) | (df[column] > upper_bound)]

# Find outliers for title and text lengths
true_news_title_outliers = find_outliers(true_news, 'title_length')
true_news_text_outliers = find_outliers(true_news, 'text_length')

fake_news_title_outliers = find_outliers(fake_news, 'title_length')
fake_news_text_outliers = find_outliers(fake_news, 'text_length')

# Count outliers in each dataset
true_news_title_outliers_count = len(true_news_title_outliers)
true_news_text_outliers_count = len(true_news_text_outliers)

fake_news_title_outliers_count = len(fake_news_title_outliers)
fake_news_text_outliers_count = len(fake_news_text_outliers)

print("Number of title length outliers in True News:", true_news_title_outliers_count)
print("Number of text length outliers in True News:", true_news_text_outliers_count)

print("Number of title length outliers in Fake News:", fake_news_title_outliers_count)
print("Number of text length outliers in Fake News:", fake_news_text_outliers_count)

## Task 7: Check correlation between numerical features.
- Is there any correlation between text length, title length, and the label (true/fake)?
- Can we identify any potential relationships between these features?

## Task 8: Visualize word clouds for common words.
- What are the most common words in true news titles versus fake news titles?
- Can any keywords or patterns be identified that distinguish between true and fake news?
  

### Text and title length without spaces and punctuations

In [26]:
def calculate_phrase_mean_length(texts):
    phrase_mean_length = []

    for t in texts:
        # Split text into phrases (by period)
        phrases = [phrase.strip() for phrase in t.split('.')]  

        # Get the length of each phrase in terms of word count
        phrase_lengths = [len(phrase.split()) for phrase in phrases if phrase]  

        # Calculate the mean length of the phrases
        if phrase_lengths:  
            mean_length = sum(phrase_lengths) / len(phrase_lengths)
        else:
            mean_length = 0

        phrase_mean_length.append(mean_length)
    
    return phrase_mean_length


df['length_phrases_no_punctuation_text'] = calculate_phrase_mean_length(df['text'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759
...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727


In [27]:
#mean text length of fake and true news

f = df[df['label']== 'Fake']['length_phrases_no_punctuation_text'].mean()
print(f)

t = df[df['label']== 'True']['length_phrases_no_punctuation_text'].mean()
print(t)

19.962980965349598
20.0282793741873


In [28]:
df['length_phrases_no_punctuation_title'] = calculate_phrase_mean_length(df['title'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000
...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000


In [29]:
ft = df[df['label']== 'Fake']['length_phrases_no_punctuation_title'].mean()
print(ft)

tt = df[df['label']== 'True']['length_phrases_no_punctuation_title'].mean()
print(tt)

14.170286736824169
8.45882254195043


In [30]:
ft = df[df['label']== 'Fake']['title_length'].mean()
print(ft)

tt = df[df['label']== 'True']['title_length'].mean()
print(tt)

94.20365557496486
64.667880655554


### Qtt of numbers in the text

In [31]:
import re

def count_numbers_in_texts(texts):
    number_counts = []

    for text in texts:
        # Use regular expression to find all sequences of digits
        numbers = re.findall(r'\d+', text)
        
        # Count the number of sequences found
        number_count = len(numbers)
        
        number_counts.append(number_count)
    
    return number_counts



df['qtt_numbers'] = count_numbers_in_texts(df['text'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34
...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18


In [32]:
fn = df[df['label']== 'Fake']['qtt_numbers'].mean()
print(fn)

tn = df[df['label']== 'True']['qtt_numbers'].mean()
print(tn)

7.388053342422564
6.010692440584583


### Qtt noms propis

In [33]:
def count_proper_nouns(texts):
    proper_noun_counts = []

    for text in texts:
        # Split text into sentences using punctuation as delimiters
        sentences = re.split(r'[.!?]\s+', text)

        proper_noun_count = 0

        # For each sentence, find capitalized words that are NOT at the start
        for sentence in sentences:
            # Split sentence into words
            words = sentence.split()
            
            # Exclude the first word in the sentence and count proper nouns
            proper_nouns = [word for word in words[1:] if re.match(r'\b[A-Z][a-z]*\b', word)]
            
            proper_noun_count += len(proper_nouns)
        
        proper_noun_counts.append(proper_noun_count)
    
    return proper_noun_counts

df['qtt_noms_propis'] = count_proper_nouns(df['text'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92
...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54


In [73]:
fp = df[df['label']== 'Fake']['qtt_noms_propis'].mean()
print(fp)

tp = df[df['label']== 'True']['qtt_noms_propis'].mean()
print(tp)

49.572664138724384
51.42442919176355


### Qtt signes de puntuació

In [36]:
def count_punctuation_signals(texts):
    punctuation_counts = []

    # Define a regular expression for common punctuation marks
    punctuation_pattern = r'[.,!?;:()\[\]\'\"-]'

    for text in texts:
        # Find all punctuation marks in the text
        punctuation_signals = re.findall(punctuation_pattern, text)
        
        # Count the number of punctuation marks
        punctuation_count = len(punctuation_signals)
        
        punctuation_counts.append(punctuation_count)
    
    return punctuation_counts

df['qtt_punt'] = count_punctuation_signals(df['text'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128
...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61


In [37]:
fsp = df[df['label']== 'Fake']['qtt_punt'].mean()
print(fsp)

tsp = df[df['label']== 'True']['qtt_punt'].mean()
print(tsp)

54.744407992842234
49.721949852920574


### Qtt stop words

In [39]:
import nltk

In [40]:
from nltk.corpus import stopwords
 
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/paucolomercoll/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
nltk_stopwords = set(stopwords.words('english'))

def count_stopwords(texts):
    stopword_counts = []

    for text in texts:
        # Tokenize the text into words, lowercased
        words = re.findall(r'\b\w+\b', text.lower())
        
        # Count the number of stopwords
        stopword_count = sum(1 for word in words if word in nltk_stopwords)
        
        stopword_counts.append(stopword_count)
    
    return stopword_counts

df['stopwords'] = count_stopwords(df['text'])

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353
...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190


In [42]:
fsw = df[df['label']== 'Fake']['stopwords'].mean()
print(fsw)

tsw = df[df['label']== 'True']['stopwords'].mean()
print(tsw)

192.71113288739295
158.5377970770883


### Qtt of emotional words

In [43]:
emotional_words = [
    'ecstatic', 'elated', 'blissful', 'jubilant', 'cheerful', 'thrilled', 'delight', 'exhilarated',
    'content', 'grateful', 'affection', 'adoration', 'devotion', 'tenderness', 'caring', 'compassionate',
    'empathy', 'cherish', 'admire', 'fondness', 'hopeful', 'confident', 'optimistic', 'encouraged',
    'reassured', 'inspired', 'ambitious', 'enthusiastic', 'dreamy', 'faithful',
    'terrified', 'horrified', 'anxious', 'panicked', 'frightened', 'apprehensive', 'worried', 'nervous',
    'alarmed', 'dread', 'furious', 'outraged', 'enraged', 'livid', 'infuriated', 'annoyed', 'frustrated',
    'irritated', 'exasperated', 'agitated', 'heartbroken', 'devastated', 'mournful', 'sorrowful',
    'melancholy', 'dejected', 'despair', 'loneliness', 'hopeless', 'disheartened',
    'amazed', 'stunned', 'shocked', 'astounded', 'astonished', 'startled', 'dumbfounded', 'speechless',
    'perplexed', 'bewildered',
    'disgusted', 'repulsed', 'nauseated', 'revolted', 'loathing', 'detest', 'despise', 'scorn',
    'contemptuous', 'abhor',
    'ashamed', 'embarrassed', 'guilty', 'humiliated', 'regretful', 'remorseful', 'mortified', 
    'self-conscious', 'apologetic', 'disgraced'
]

In [44]:
def count_emotional_words(text):
    # Ensure text is a string and normalize: remove punctuation and convert to lowercase
    if not isinstance(text, str):
        return 0
    words = re.findall(r'\b\w+\b', text.lower())
    
    # Count how many emotional words are in the text
    emotional_word_count = sum(1 for word in words if word in emotional_words)
    
    return emotional_word_count

df['emotional'] = df['text'].apply(count_emotional_words)

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202,0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156,1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216,3
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136,1
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873,0
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190,0


In [91]:
few = df[df['label']== 'Fake']['emotional'].mean()
print(few)

tew = df[df['label']== 'True']['emotional'].mean()
print(tew)

0.3467683524349197
0.24989494326936545


### Qtt of words that have been reptead more than 10 times that are not stopwords

In [46]:
from collections import Counter

def count_repeated_non_stopwords(text, threshold=10):
    # Load English stopwords from NLTK
    stop_words = set(stopwords.words('english'))
    
    # Normalize the text: remove punctuation and convert to lowercase
    words = re.findall(r'\b\w+\b', text.lower())
    
    # Filter out stopwords
    non_stopwords = [word for word in words if word not in stop_words]
    
    # Count the frequency of each non-stopword
    word_counts = Counter(non_stopwords)
    
    # Filter words repeated more than 'threshold' times
    repeated_words = [word for word, count in word_counts.items() if count > threshold]
    
    # Return the count of non-stopwords repeated more than 'threshold' times
    return len(repeated_words)

df['repeated'] = df['text'].apply(count_repeated_non_stopwords)

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional,repeated
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0,0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0,2
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202,0,1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156,1,0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216,3,1
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136,1,0
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873,0,19
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190,0,0


In [47]:
frnw = df[df['label']== 'Fake']['repeated'].mean()
print(frnw)

trnw = df[df['label']== 'True']['repeated'].mean()
print(trnw)

0.6609432917216991
0.5772517159266004


### Nº of capital letters in the title

In [48]:
def count_capital_letters(text):
    # Count the number of uppercase letters in the text
    return sum(1 for char in text if char.isupper())


df['capital_title'] = df['title'].apply(count_capital_letters)

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional,repeated,capital_title
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0,0,4
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0,2,4
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202,0,1,7
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156,1,0,8
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353,0,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216,3,1,13
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136,1,0,15
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873,0,19,12
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190,0,0,9


In [49]:
fc = df[df['label']== 'Fake']['capital_title'].mean()
print(fc)

tc = df[df['label']== 'True']['capital_title'].mean()
print(tc)

27.850965020663796
3.551991408694028


### nº of references

In [50]:
def count_references(text):
    # Define common patterns for references:
    reference_patterns = [
        r'\[\d+\]',                          # Numeric references like [1], [23]
        r'\(\d+\)',                          # Numeric references like (1), (23)
        r'\([A-Za-z]+, \d{4}\)',             # Parenthetical academic references like (Smith, 2020)
        r'\b\d{4}\b',                        # Standalone years like 2020, 2019
        r'\bdoi:?\s?10\.\d{4,9}/[-._;()/:A-Z0-9]+\b',  # DOI references (e.g., doi:10.1234/abcd.5678)
        r'(https?://[^\s]+)',                # URLs (e.g., http://example.com or https://doi.org)
        r'\bISBN[-\s]?(?:\d{9}[\dXx]|\d{13})\b',  # ISBN numbers (ISBN-10 or ISBN-13 format)
        r'\^[1-9]\d*',                       # Footnotes like ^1, ^2, etc.
        r'[A-Za-z]+ et al\., \d{4}',         # Common academic style, e.g., Smith et al., 2020
        r'\([A-Za-z]+ et al\., \d{4}\)'      # Full academic reference style, e.g., (Smith et al., 2020)
    ]
    
    # Combine all patterns into a single regex pattern
    combined_pattern = '|'.join(reference_patterns)
    
    # Find all matches in the text
    references = re.findall(combined_pattern, text)
    
    return len(references)

df['reference_count'] = df['text'].apply(count_references)

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional,repeated,capital_title,reference_count
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0,0,4,3
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0,2,4,2
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202,0,1,7,2
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156,1,0,8,2
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353,0,3,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216,3,1,13,1
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136,1,0,15,1
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873,0,19,12,12
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190,0,0,9,6


In [51]:
fr = df[df['label']== 'Fake']['reference_count'].mean()
print(fr)

tr = df[df['label']== 'True']['reference_count'].mean()
print(tr)

1.5326147160325507
1.3391231264883037


### length word

In [52]:
def mean_word_length(text):
    # Normalize the text by removing punctuation and extracting words
    words = re.findall(r'\b\w+\b', text)
    
    # Calculate the lengths of the words
    word_lengths = [len(word) for word in words]
    
    # Calculate the mean (average) word length
    if word_lengths:  # Ensure there are words to avoid division by zero
        mean_length = sum(word_lengths) / len(word_lengths)
    else:
        mean_length = 0
    
    return mean_length


df['word_length'] = df['text'].apply(mean_word_length)

df

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional,repeated,capital_title,reference_count,word_length
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0,0,4,3,4.914921
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0,2,4,2,5.237500
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.750000,2,75,47,202,0,1,7,2,4.847639
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.000000,9.000000,3,75,51,156,1,0,8,2,5.197943
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.000000,34,92,128,353,0,3,4,4,4.752554
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44883,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,Fake,3237,61,27.894737,10.000000,7,62,47,216,3,1,13,1,4.914657
44884,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,Fake,1684,81,23.000000,12.000000,5,24,42,136,1,0,15,1,4.364821
44885,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,Fake,25065,85,29.836879,14.000000,37,476,527,1873,0,19,12,12,4.719261
44886,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,Fake,2685,67,21.272727,12.000000,18,54,61,190,0,0,9,6,4.495798


In [53]:
fmw = df[df['label']== 'Fake']['word_length'].mean()
print(fmw)

tmw = df[df['label']== 'True']['word_length'].mean()
print(tmw)

4.562553885608005
4.927742579875065


In [54]:
df.head()

Unnamed: 0,title,text,subject,label,text_length,title_length,length_phrases_no_punctuation_text,length_phrases_no_punctuation_title,qtt_numbers,qtt_noms_propis,qtt_punt,stopwords,emotional,repeated,capital_title,reference_count,word_length
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,True,4659,64,17.904762,3.666667,17,88,113,307,0,0,4,3,4.914921
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,True,4077,64,17.081081,3.333333,9,71,77,253,0,2,4,2,5.2375
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,True,2789,60,20.909091,2.75,2,75,47,202,0,1,7,2,4.847639
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,True,2461,59,18.0,9.0,3,75,51,156,1,0,8,2,5.197943
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,True,5204,69,14.982759,11.0,34,92,128,353,0,3,4,4,4.752554


In [55]:
df.to_csv('data/News_dataset/initial_features.csv', index = False)


## Task 10: Identify potential biases in the data.
Are there any potential biases in the dataset, such as overrepresentation of certain subjects, time periods, or types of news?
How might these biases affect the analysis and conclusions?

### Close-up: Importance of Initial Exploratory Data Analysis (EDA)

The initial exploratory data analysis (EDA) is a critical step in the data science process, as it allows us to become familiar with the dataset, understand its structure, and identify key patterns or issues that need to be addressed. This step helps us uncover how the data behaves, spot anomalies, detect missing values, and analyze the distribution of key features such as publication date, article length, subjects, and more.

By conducting this thorough examination of the data, we gain valuable insights that inform our next steps, such as which features are most relevant, how different variables correlate, and what might impact our ability to detect fake news. This understanding helps refine the research questions that will drive the analysis moving forward.

After the EDA, the primary outcome is the formulation of clear research questions. These questions will help us address the main goal: **detecting whether a given news article is fake or real.** The insights gained during EDA guide us in determining which features are significant and which ones need further processing or transformation. This provides a roadmap for the next steps, ensuring that our analysis is not only data-driven but also focused on solving the core problem efficiently.

### Task 11: Define Key Features for Further Analysis
- Based on the EDA, select and refine the most relevant features (e.g., article length, title length, subject, day of the week, and publication date).
- Decide on any necessary feature engineering steps, such as creating new features, encoding categorical variables, or scaling numerical features.
- Prepare the dataset for deeper analysis and modeling by addressing the research questions we’ve identified, which aim to solve the main goal: distinguishing between fake and real news.

This final task marks the transition from EDA to more targeted feature engineering and analysis, setting the stage for building models that can accurately detect fake news based on the refined features.