<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - INRS - Dataset preparation 2

## Importing the required libraries

In [1]:
import re
import pandas as pd

## Setting input and output filenames

Set the `input_filename` with the filename of the file to be processed.

In [2]:
input_filename = 'news_cleaned_fake.jsonl'
suffix = '_prep'

def add_suffix(filename):
    # Extract the base filename without the extension
    base_filename = re.match(r'^([A-Za-z0-9-_,\s]+)\.[A-Za-z]{1,5}$', filename).group(1)
    
    # Append suffix to the base filename
    new_filename = f'{base_filename}{suffix}'
    
    # Add the original file extension back
    new_filename += re.search(r'\.[A-Za-z]{1,5}$', filename).group()
    
    return new_filename

output_filename = add_suffix(input_filename)

## Data wrangling

### Importing the tweet raw data into a dataframe

In [3]:
df_tweets_raw_data = pd.read_json(input_filename, lines=True)

In [4]:
df_tweets_raw_data.head(5)

Unnamed: 0.1,Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,authors,keywords,meta_keywords,meta_description,tags,summary,source
0,27,34,beforeitsnews.com,fake,http://beforeitsnews.com/opinion-conservative/...,Headline: Bitcoin & Blockchain Searches Exceed...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Surprise: Socialist Hotbed Of Venezuela Has Lo...,The Pirate'S Cove,,[''],,,,
1,28,35,beforeitsnews.com,fake,http://beforeitsnews.com/politics/2018/01/wate...,Water Cooler 1/25/18 Open Thread; Fake News ? ...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Water Cooler 1/25/18 Open Thread; Fake News ? ...,,,[''],,,,
2,29,36,beforeitsnews.com,fake,http://beforeitsnews.com/politics/2018/01/vete...,Veteran Commentator Calls Out the Growing “Eth...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Veteran Commentator Calls Out the Growing “Eth...,,,[''],,,,
3,30,37,beforeitsnews.com,fake,http://beforeitsnews.com/arts/2018/01/lost-wor...,"Lost Words, Hidden Words, Otters, Banks and Bo...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,"Lost Words, Hidden Words, Otters, Banks and Books",Jackie Morris Artist,,[''],,,,
4,31,38,beforeitsnews.com,fake,http://beforeitsnews.com/financial-markets/201...,Red Alert: Bond Yields Are SCREAMING “Inflatio...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Red Alert: Bond Yields Are SCREAMING “Inflatio...,Phoenix Capital Research,,[''],,,,


In [5]:
df_tweets_raw_data.shape

(894746, 17)

### Checking if data types are consistent

In [6]:
df_tweets_raw_data.dtypes

Unnamed: 0                   int64
id                           int64
domain                      object
type                        object
url                         object
content                     object
scraped_at          datetime64[ns]
inserted_at         datetime64[ns]
updated_at          datetime64[ns]
title                       object
authors                     object
keywords                   float64
meta_keywords               object
meta_description            object
tags                        object
summary                    float64
source                     float64
dtype: object

#### Converting `id` column's data type to `str` for future use

In [7]:
df_tweets_raw_data['id'] = df_tweets_raw_data['id'].astype(str)

### Fixing missing values in the column `authors`

In [8]:
print(df_tweets_raw_data['authors'].isnull().sum())

349528


In [9]:
df_tweets_raw_data['authors'].unique()

array(["The Pirate'S Cove", None, 'Jackie Morris Artist', ...,
       'Tracy Mitchell, Jim Asherman, Lillian Geiger Smith, Anthony Melé',
       'Joe W., Cm Sackett, David Robertson, Cathleen James, Sam R, Jason Rennie, Larry Gibby',
       'Robert Rivera, Michael Rodriguez, Haha, Rock Hillbilly, Patriot, Bill Mcmicheals, Press Watchusa, Marlene Hessler, John Bohler, A. P.'],
      dtype=object)

In [10]:
df_tweets_raw_data['authors'].fillna('None', inplace=True)

In [11]:
print(df_tweets_raw_data['authors'].isnull().sum())

0


### Checking the columns `id`, `url` and `scraped_at` for missing values

In [12]:
print(df_tweets_raw_data['id'].isnull().sum())

0


In [13]:
print(df_tweets_raw_data['url'].isnull().sum())

0


In [14]:
print(df_tweets_raw_data['scraped_at'].isnull().sum())

0


### Extracting the column `created_at`

In [15]:
# Extract the 'scraped_at' attribute
created_at = df_tweets_raw_data['scraped_at']

# Create a new column 'created_at'
df_tweets_raw_data['created_at'] = created_at

### Extracting the column `author_id`

In [16]:
# Extract the 'id' attribute
author_id = df_tweets_raw_data['id']

# Create a new column 'author_id'
df_tweets_raw_data['author_id'] = author_id

### Extracting the column `username`

In [17]:
# Extract the 'authors' attribute
username = df_tweets_raw_data['authors']

# Create a new column 'username'
df_tweets_raw_data['username'] = username

### Extracting the column `tweet_url`

In [18]:
# Extract the 'url' attribute
tweet_url = df_tweets_raw_data['url']

# Create a new column 'tweet_url'
df_tweets_raw_data['tweet_url'] = tweet_url

### Extracting the column `text`

In [19]:
# Extract the 'content' attribute
text = df_tweets_raw_data['content']

# Create a new column 'text'
df_tweets_raw_data['text'] = text

In [20]:
df_tweets_raw_data

Unnamed: 0.1,Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,...,meta_keywords,meta_description,tags,summary,source,created_at,author_id,username,tweet_url,text
0,27,34,beforeitsnews.com,fake,http://beforeitsnews.com/opinion-conservative/...,Headline: Bitcoin & Blockchain Searches Exceed...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Surprise: Socialist Hotbed Of Venezuela Has Lo...,...,[''],,,,,2018-01-25 16:17:44.789555,34,The Pirate'S Cove,http://beforeitsnews.com/opinion-conservative/...,Headline: Bitcoin & Blockchain Searches Exceed...
1,28,35,beforeitsnews.com,fake,http://beforeitsnews.com/politics/2018/01/wate...,Water Cooler 1/25/18 Open Thread; Fake News ? ...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Water Cooler 1/25/18 Open Thread; Fake News ? ...,...,[''],,,,,2018-01-25 16:17:44.789555,35,,http://beforeitsnews.com/politics/2018/01/wate...,Water Cooler 1/25/18 Open Thread; Fake News ? ...
2,29,36,beforeitsnews.com,fake,http://beforeitsnews.com/politics/2018/01/vete...,Veteran Commentator Calls Out the Growing “Eth...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Veteran Commentator Calls Out the Growing “Eth...,...,[''],,,,,2018-01-25 16:17:44.789555,36,,http://beforeitsnews.com/politics/2018/01/vete...,Veteran Commentator Calls Out the Growing “Eth...
3,30,37,beforeitsnews.com,fake,http://beforeitsnews.com/arts/2018/01/lost-wor...,"Lost Words, Hidden Words, Otters, Banks and Bo...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,"Lost Words, Hidden Words, Otters, Banks and Books",...,[''],,,,,2018-01-25 16:17:44.789555,37,Jackie Morris Artist,http://beforeitsnews.com/arts/2018/01/lost-wor...,"Lost Words, Hidden Words, Otters, Banks and Bo..."
4,31,38,beforeitsnews.com,fake,http://beforeitsnews.com/financial-markets/201...,Red Alert: Bond Yields Are SCREAMING “Inflatio...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Red Alert: Bond Yields Are SCREAMING “Inflatio...,...,[''],,,,,2018-01-25 16:17:44.789555,38,Phoenix Capital Research,http://beforeitsnews.com/financial-markets/201...,Red Alert: Bond Yields Are SCREAMING “Inflatio...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
894741,9045,7981331,theinternetpost.net,fake,https://theinternetpost.net/2017/12/13/nra-sil...,"The National Rifle Association (NRA), which fa...",2018-01-01 15:13:06.815222,2018-02-08 19:18:34.468038,2018-02-08 19:18:34.468066,"NRA Silent as States, Feds Ban Gun Sales to Me...",...,[''],"The National Rifle Association (NRA), which fa...","NRA, gun sales, Medical Marijuana Patients",,,2018-01-01 15:13:06.815222,7981331,,https://theinternetpost.net/2017/12/13/nra-sil...,"The National Rifle Association (NRA), which fa..."
894742,9053,7981340,theinternetpost.net,fake,https://theinternetpost.net/category/economics/,‘Florida’s 3rd District Court of Appeal just c...,2018-01-01 15:13:06.815222,2018-02-08 19:18:34.468038,2018-02-08 19:18:34.468066,THE INTERNET POST,...,[''],"Posts about economics written by ajfloyd, bjja...","US, XtendiMax, Opium poppy, national debt, Hom...",,,2018-01-01 15:13:06.815222,7981340,,https://theinternetpost.net/category/economics/,‘Florida’s 3rd District Court of Appeal just c...
894743,9062,7981349,theinternetpost.net,fake,https://theinternetpost.net/tag/vaccines/,The Institute for Justice estimates one out of...,2018-01-01 15:13:06.815222,2018-02-08 19:18:34.468038,2018-02-08 19:18:34.468066,THE INTERNET POST,...,[''],"Posts about vaccines written by kristalklear, ...","Meningitis, Michigan appeals court, Michigan, ...",,,2018-01-01 15:13:06.815222,7981349,,https://theinternetpost.net/tag/vaccines/,The Institute for Justice estimates one out of...
894744,9937,7982299,therightscoop.com,fake,http://therightscoop.com/rush-reads-the-obama-...,This is an instant classic. I call it the “Oba...,2018-01-01 15:13:06.815222,2018-02-08 19:18:34.468038,2018-02-08 19:18:34.468066,Rush reads the “Obama Blames…” diaries,...,[''],,,,,2018-01-01 15:13:06.815222,7982299,"Joe W., Cm Sackett, David Robertson, Cathleen ...",http://therightscoop.com/rush-reads-the-obama-...,This is an instant classic. I call it the “Oba...


### Inspecting the data

In [21]:
inspected_row = 3
print('username:' + df_tweets_raw_data.loc[inspected_row, 'username'])
print('content:' + df_tweets_raw_data.loc[inspected_row, 'content'])
print('tweet_url:' + df_tweets_raw_data.loc[inspected_row, 'tweet_url'])

username:Jackie Morris Artist
content:Lost Words, Hidden Words, Otters, Banks and Books

% of readers think this story is Fact. Add your two cents.

Headline: Bitcoin & Blockchain Searches Exceed Trump! Blockchain Stocks Are Next!

Let me tell you something, about otters and money, books and banks.

Wonderful news today as Jane Beaton’s crowd funding initiative gets an extra couple of weeks to raise it’s target. And because in the process of this learning curve Penguin Books came on board in a massive way that target has been massively reduced. Half way there. Hoping to push through to funding to see a copy of The Lost Words brought to every school in Scotland.

To celebrate I want to auction this absolutely unique proof of the silk otter scarf, produced by Beckford Silk for Compton Verney. (It’s printed on paper, not silk. There may still be scarves for sale at Compton Verney. You could wrap yourself in otters.)

A post shared by Jackie Morris (@jackiemorrisartist) on Jan 25, 2018 at 

### Creating the output file

In [22]:
df_tweets_raw_data[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_json(output_filename, orient='records', lines=True)