# Most Common Words per Classification in news.csv

The file news.csv has data extracted from news on the internet with its corresponding classification depending on its content. They have five classifications: economy, sports, science, culture and entertainment.<br><br>
We create a function that takes the data from the csv file and prints the five classifications with a list of the x most repeated words for each classification. The function has two parameters: the name of the file to be read and the number of words to show.<br><br>
We use pandas, NLTK, gensim and collections.

In [1]:
!pip install gensim



In [2]:
# Import libraries

import pandas as pd
import nltk
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from collections import Counter

In [3]:
# Download stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nykolai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Obtain additional stopwords from nltk

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

## Function

In [5]:
# Remove stopwords and remove words with less than 3 characters

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words:
            result.append(token)            
    return result

In [6]:
# Identify the most common n words of a list

def count(list, n):
    total = Counter(list)
    result = total.most_common(n)
    words = []
    for word in result:
        words.append(word[0])
    return words

In [7]:
# Open the news.csv file and return the most common x words per classification

def most_common_words(file, x):
    
    # Read file into df dataframe
    df = pd.read_csv(file)
    df['clean'] = df['content'].apply(preprocess) # Apply the preprocess function to the dataframe
    
    # Create one list per category containing the whole of words corresponding to each classification
    sports = df.loc[df['classification'] == 'sports']['clean'].sum()
    economy = df.loc[df['classification'] == 'economy']['clean'].sum()
    entertainment = df.loc[df['classification'] == 'entertainment']['clean'].sum()
    culture = df.loc[df['classification'] == 'culture']['clean'].sum()
    science = df.loc[df['classification'] == 'science']['clean'].sum()
    
    # Show results
    print('The ' + str(x) + ' most common words per classification in the file news.csv are:' +
        '\n\nSports: ' + str(count(sports, x)) + 
         '\nEntertainment: ' + str(count(entertainment, x)) +
         '\nCulture: ' + str(count(culture, x)) +
         '\nScience: ' + str(count(science, x)) + 
         '\nEconomy: ' + str(count(economy, x)))
    

In [8]:
# Function
most_common_words('news.csv', 8) # Number of shown words can be changed

The 8 most common words per classification in the file news.csv are:

Sports: ['race', 'barty', 'time', 'liverpool', 'year', 'said', 'weekend', 'world']
Entertainment: ['west', 'kanye', 'time', 'sophie', 'instagram', 'said', 'couple', 'madsen']
Culture: ['said', 'hall', 'black', 'african', 'american', 'history', 'dallas', 'dance']
Science: ['forests', 'says', 'water', 'spinosaurus', 'cooling', 'like', 'global', 'carbon']
Economy: ['said', 'growth', 'china', 'year', 'russia', 'investment', 'economy', 'russian']


### Appendix: Step-by-step with examples

In [9]:
df_ex = pd.read_csv('news.csv')

In [10]:
# View original file
df_ex

Unnamed: 0,title,content,classification,webPage
0,Is America’s Economy Entering a New Normal?,The pandemic and now the war in Ukraine have a...,economy,https://www.nytimes.com/2022/03/24/business/ec...
1,Mohamed Salah: New Liverpool deal likely to ha...,"Mohamed Salah's options would be going ""sidewa...",sports,https://www.bbc.com/sport/football/60856636
2,NASA’s exoplanet count surges past 5000,It’s official: The number of planets known bey...,science,https://www.sciencenews.org/article/exoplanet-...
3,Moscow insiders describe panic frustration and...,Yuriy Shatalov (not his real name) an equity t...,economy,https://www.businessinsider.com/russian-trader...
4,Five 4000 Year Old Painted Tombs Discovered in...,Five painted tombs were recently unearthed in ...,culture,https://www.artnews.com/art-news/news/saqqara-...
5,Julia Fox Backtracks After Calling Kanye West’...,She didn’t have all the facts. Julia Fox is ta...,entertainment,https://www.usmagazine.com/celebrity-news/news...
6,Gifted and at the top of her game -- Ashleigh ...,(CNN)On her day Ashleigh Barty was an unstoppa...,sports,https://edition.cnn.com/2022/03/23/tennis/ashl...
7,Spinosaurus’ dense bones fuel debate over whet...,A fierce group of predatory dinosaurs may have...,science,https://www.sciencenews.org/article/spinosauru...
8,Why did this cultural mecca for Dallas' Black ...,"""Dallas Rising: The Hall of Negro Life"" airs F...",culture,https://www.keranews.org/arts-culture/2022-03-...
9,Forests help reduce global warming in more way...,When it comes to cooling the planet forests ha...,science,https://www.sciencenews.org/article/forest-tre...


In [11]:
# Show example of news before removing stopwords
df_ex['content'][0]

'The pandemic and now the war in Ukraine have altered how America’s economy functions. While economists have spent months waiting for conditions to return to normal they are beginning to wonder what “normal” will mean. Some of the changes are noticeable in everyday life: Work from home is more popular burrito bowls and road trips cost more and buying a car or a couch made overseas is harder. But those are all symptoms of broader changes sweeping the economy — ones that could be a big deal for consumers businesses and policymakers alike if they linger. Consumer demand has been hot for months now workers are desperately wanted wages are climbing at a rapid clip and prices are rising at the fastest pace in four decades as vigorous buying clashes with roiled supply chains. Interest rates are expected to rise higher than they ever did in the 2010s as the Federal Reserve tries to rein in inflation. History is full of big moments that have changed America’s economic trajectory: The Great Depr

In [12]:
# Apply the preprocess function to the dataframe
df_ex['clean'] = df_ex['content'].apply(preprocess)

In [13]:
# Show example of cleaned up news after removing stopwords
print(df_ex['clean'][0])

['pandemic', 'ukraine', 'altered', 'america', 'economy', 'functions', 'economists', 'spent', 'months', 'waiting', 'conditions', 'return', 'normal', 'beginning', 'wonder', 'normal', 'mean', 'changes', 'noticeable', 'everyday', 'life', 'work', 'home', 'popular', 'burrito', 'bowls', 'road', 'trips', 'cost', 'buying', 'couch', 'overseas', 'harder', 'symptoms', 'broader', 'changes', 'sweeping', 'economy', 'ones', 'deal', 'consumers', 'businesses', 'policymakers', 'alike', 'linger', 'consumer', 'demand', 'months', 'workers', 'desperately', 'wanted', 'wages', 'climbing', 'rapid', 'clip', 'prices', 'rising', 'fastest', 'pace', 'decades', 'vigorous', 'buying', 'clashes', 'roiled', 'supply', 'chains', 'rates', 'expected', 'rise', 'higher', 'federal', 'reserve', 'tries', 'rein', 'inflation', 'history', 'moments', 'changed', 'america', 'economic', 'trajectory', 'great', 'depression', 'great', 'inflation', 'great', 'recession', 'examples', 'early', 'know', 'sure', 'changes', 'happening', 'today', '

In [14]:
# Show dataframe with 'clean' column
df_ex

Unnamed: 0,title,content,classification,webPage,clean
0,Is America’s Economy Entering a New Normal?,The pandemic and now the war in Ukraine have a...,economy,https://www.nytimes.com/2022/03/24/business/ec...,"[pandemic, ukraine, altered, america, economy,..."
1,Mohamed Salah: New Liverpool deal likely to ha...,"Mohamed Salah's options would be going ""sidewa...",sports,https://www.bbc.com/sport/football/60856636,"[mohamed, salah, options, going, sideways, man..."
2,NASA’s exoplanet count surges past 5000,It’s official: The number of planets known bey...,science,https://www.sciencenews.org/article/exoplanet-...,"[official, number, planets, known, solar, pass..."
3,Moscow insiders describe panic frustration and...,Yuriy Shatalov (not his real name) an equity t...,economy,https://www.businessinsider.com/russian-trader...,"[yuriy, shatalov, real, equity, trader, moscow..."
4,Five 4000 Year Old Painted Tombs Discovered in...,Five painted tombs were recently unearthed in ...,culture,https://www.artnews.com/art-news/news/saqqara-...,"[painted, tombs, recently, unearthed, saqqara,..."
5,Julia Fox Backtracks After Calling Kanye West’...,She didn’t have all the facts. Julia Fox is ta...,entertainment,https://www.usmagazine.com/celebrity-news/news...,"[facts, julia, taking, claims, kanye, west, ha..."
6,Gifted and at the top of her game -- Ashleigh ...,(CNN)On her day Ashleigh Barty was an unstoppa...,sports,https://edition.cnn.com/2022/03/23/tennis/ashl...,"[ashleigh, barty, unstoppable, force, sudden, ..."
7,Spinosaurus’ dense bones fuel debate over whet...,A fierce group of predatory dinosaurs may have...,science,https://www.sciencenews.org/article/spinosauru...,"[fierce, group, predatory, dinosaurs, hunting,..."
8,Why did this cultural mecca for Dallas' Black ...,"""Dallas Rising: The Hall of Negro Life"" airs F...",culture,https://www.keranews.org/arts-culture/2022-03-...,"[dallas, rising, hall, negro, life, airs, frid..."
9,Forests help reduce global warming in more way...,When it comes to cooling the planet forests ha...,science,https://www.sciencenews.org/article/forest-tre...,"[comes, cooling, planet, forests, trick, trees..."


In [15]:
# Add all words from sports category into a 'sports' list
sports = df_ex.loc[df_ex['classification'] == 'sports']['clean'].sum()

In [16]:
# View the 'sports' list before counting its most common words
print(sports)

['mohamed', 'salah', 'options', 'going', 'sideways', 'manchester', 'city', 'left', 'liverpool', 'says', 'reds', 'striker', 'michael', 'owen', 'year', 'egypt', 'forward', 'deal', 'runs', 'summer', 'previously', 'said', 'wanted', 'stay', 'future', 'liverpool', 'hands', 'reds', 'manager', 'jurgen', 'klopp', 'said', 'deal', 'decision', 'surprised', 'salah', 'sign', 'liverpool', 'options', 'moment', 'owen', 'said', 'going', 'sideways', 'manchester', 'city', 'good', 'teams', 'world', 'moment', 'time', 'things', 'happen', 'overnight', 'talking', 'biggest', 'clubs', 'world', 'biggest', 'players', 'world', 'player', 'sign', 'biggest', 'contract', 'life', 'think', 'entirely', 'normal', 'negotiations', 'drag', 'little', 'owen', 'told', 'sport', 'salah', 'joined', 'liverpool', 'italian', 'club', 'roma', 'summer', 'scored', 'goals', 'appearances', 'helped', 'club', 'premier', 'league', 'champions', 'league', 'league', 'fifa', 'club', 'world', 'uefa', 'super', 'time', 'anfield', 'owen', 'scored', 'g

In [17]:
# Apply count function to list 'sports'
count(sports,5)

['race', 'barty', 'time', 'liverpool', 'year']