<a href="https://colab.research.google.com/github/jjc16/Project_Notebooks/blob/master/NLP_Data_Cleaning_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP on Technology Articles

## Project Background

In this project, a client asked me to identify the "top 10" keywords from a group of articles from a technology magazine. The client gave me freedom to define the problem in a way that made sense. The results of the exercise are below.

## Imports:
First, I import and install several necessary libraries

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
import nltk
# nltk.download('all')

## Load data:

I now load the data into a Pandas DataFrame

In [2]:
pth = '/content/drive/MyDrive/Moodys_Assignment/assignment/News.csv'
data = pd.read_csv(pth)

## Data Exploration: 

In [3]:
data.head()

Unnamed: 0,news_id,title,content,link,date
0,1,Do chatbots really help you stay productive?,GUEST: When Slack burst onto the workplace sce...,http://venturebeat.com/?p=2141494,2017-01-01
1,2,Spanish social advertising company Adsmurai ra...,Barcelona-based social advertising company Ads...,http://venturebeat.com/?p=2141069,2017-01-01
2,3,HTC: No Vive 2 at CES,I\u2019d wager most people who bought the HTC ...,http://venturebeat.com/?p=2141559,2017-01-01
3,4,Chinese firms reportedly ordered to pay Disney...,(Reuters) &#8212;\xa0A Shanghai court ordered ...,http://venturebeat.com/?p=2141698,2017-01-01
4,5,AWS sees growth in database migrations,Public cloud market leader Amazon Web Services...,http://venturebeat.com/?p=2141375,2017-01-01


We can also specifically print out several examples of the 'content' column, which contains the articles:

In [4]:
print(data.content[0] + '\n')
print(data.content[1] + '\n')
print(data.content[2] + '\n')
print(data.content[3] + '\n')
print(data.content[4] + '\n')
print(data.content[5] + '\n')

GUEST: When Slack burst onto the workplace scene, employees rejoiced. Finally, there was a way to chat with one another without having to send a dreaded email or worse, get up and actually go chat with your coworker face-to-face. Thanks to Slack and a handful of other messaging platforms, businesses could easily communicate across teams using&#160;[&#8230;]\n

Barcelona-based social advertising company Adsmurai has received \u20ac4 million ($4.2 million) in a second round of funding led by venture capital firm Axon Partners Group, with participation from Banc Sabadell, through its program BStartup10, and Enisa, a Spanish government-funded financing group. Launched in 2014 by Marc Elena, Otto W\xfcst and Juan Antonio Robles, Adsmurai specializes&#160;[&#8230;]\n

I\u2019d wager most people who bought the HTC Vive love the unit but wish a new version would bring key improvements. A slimmer design and lighter cord, a better fit\xa0for\xa0the face and more ergonomic controllers without har

As can be seen, the text in it's native format is messy, including several artifacts from the data accumulation process and punctuation. Also, there are several common English words in the data that are less helpful in determining keywords from the article that need to be removed. I will write and use several helper functions to clean the data.

In [46]:
#Clean bad words
# bad_words = ['&', '#','\\']
def clean_bad_words(lst, bad_words = ['&', '#','\\']):
  out = []
  for word in lst:
    spl = [word[ii] for ii in range(len(word))]
    # print(spl)
    tst = [s in bad_words for s in spl]
    # print(tst)
    if not any(tst):
      # print(t)
      out.append(word)
  return out

In [47]:
#strip punctional and symbols
def strip_symbols(lst, symbols='[()$.:,;]'):
  return [l.strip(symbols) for l in lst]

In [9]:
#generate list of common words
my_file = open("/content/drive/MyDrive/Moodys_Assignment/assignment/common_words.txt", "r")
content = my_file.read()
common_words = content.split("\n")
my_file.close()
print(common_words)
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
common_words += stop_words

['the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'you', 'that', 'he', 'was', 'for', 'on', 'are', 'with', 'as', 'I', 'his', 'they', 'be', 'at', 'one', 'have', 'this', 'from', 'or', 'had', 'by', 'not', 'word', 'but', 'what', 'some', 'we', 'can', 'out', 'other', 'were', 'all', 'there', 'when', 'up', 'use', 'your', 'how', 'said', 'an', 'each', 'she', 'which', 'do', 'their', 'time', 'if', 'will', 'way', 'about', 'many', 'then', 'them', 'write', 'would', 'like', 'so', 'these', 'her', 'long', 'make', 'thing', 'see', 'him', 'two', 'has', 'look', 'more', 'day', 'could', 'go', 'come', 'did', 'number', 'sound', 'no', 'most', 'people', 'my', 'over', 'know', 'water', 'than', 'call', 'first', 'who', 'may', 'down', 'side', 'been', 'now', 'find', 'any', 'new', 'work', 'part', 'take', 'get', 'place', 'made', 'live', 'where', 'after', 'back', 'little', 'only', 'round', 'man', 'year', 'came', 'show', 'every', 'good', 'me', 'give', 'our', 'under', 'name', 'very', 'through', 'just', 'form', 'sentence', 'g

In [10]:
#remove common words
def remove_common_words(lst, common_words):
  return [l for l in lst if l not in common_words]

Finally, I am ready to extract the key words:

In [48]:
def extract_key_words(data, bad_words, common_words):
  out_list = []
  for ii in range(len(data)):
    lst = data['content'][ii].lower().split(' ')
    out = clean_bad_words(lst, bad_words)
    out = strip_symbols(out)
    out = remove_common_words(out, common_words)
    out_list += out
  return out_list

In [49]:
out_list = extract_key_words(data, bad_words, common_words)
print(out_list)



In [50]:
counter = Counter(out_list).most_common(50)
print(counter)

[('', 10340), ('today', 2774), ('announced', 2042), ('games', 1622), ('guest', 1224), ('google', 1115), ('mobile', 1053), ('technology', 889), ('release', 814), ('app', 760), ('companies', 754), ('data', 736), ('service', 735), ('reality', 730), ('platform', 724), ('tech', 695), ('ai', 685), ('microsoft', 673), ('startup', 672), ('vr', 657), ('video', 653), ('pc', 646), ('years', 632), ('billion', 618), ('u.s', 613), ('virtual', 610), ('2017', 602), ('launched', 601), ('online', 601), ('software', 592), ('developers', 574), ('reuters', 569), ('users', 567), ('facebook', 561), ('series', 543), ('funding', 542), ('according', 541), ('percent', 538), ('intelligence', 535), ('gaming', 530), ('available', 510), ('launch', 503), ('business', 491), ('10', 489), ('latest', 477), ('artificial', 476), ('digital', 473), ('developer', 473), ('amazon', 472), ('xbox', 458)]


# Final Answers:

For the final answer, we have to go through by hand and use human judgement to return the most common words. The constraints are to remove anything that can't be a named entity (like 'today', 'announced', and 'guest')

**Answers:**

1. games
2. Google
3. mobile
4. technology
5. release
6. app
7. data
8. service
9. reality
10. platform


# Extensions:

If I had several weeks to do the project, this is how I would extend it:

### Data Preparation:
  - Improve the data cleaning and munging functions to further remove common words that can't possibly be named entities or named entities-like
  - Explore word embedding models to clarify cases when named entities might be overcounted because of their similarities to common words (e.g. cloud in cloud computing)
  - Check terms for aliases (e.g. AWS = Amazon cloud computing)

### Data Comprehension:
  - Connections between themes (as measured by things like cooccurences and correlations between words being present in articles)
  - Temporal effects on data (when terms trend and decline)
  - Intra and inter article theme extraction

### Pipeline:
  - Build a pipeline to automate similar extractions across classes of articles
  - Deploy the pipeline to the cloud server (GCP, AWS) with REST API and/or interface