# Features Information

**What is Given From Forbes Scraping:** 

1. link: URL of the article (non-predictive)
2. title: title of the article
3. text: article content
4. views: number of views in text
5. topic: given subtopics from Forbes 
6. time: time article was published

**Features Added**:

1. Innovation: dummy variable for articles in the innovation topic
2. Leadership: dummy variable for articles in leadership topic
3. Lifestyle: dummy variable for article in lifestyle topic
4. Money: dummy variable for article in money topic
5. month: month article was published 
6. Month dummies (12 total) - Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
7. n_tokens_title: Number of words in the title
8. n_tokens_content: Number of words in the article
9. n_unique_tokens: Percent of unique words in the article
10. average_token_length: Average length of the words in the content
11. n_non_stop_words: Percent of non-stop words in the article
12. n_non_stop_unique_tokens: Percent of unique non-stop words in the article
13. day_of_week: day of the week the article was published 
14. Day of week dummies (7 total) - monday, tuesday, wednesday, thursday, friday, saturday, sunday 
15. Weekend_or_weekday: was article published on weekend or weekday 
16. weekday: dummy variable if article was published on weekday 
17. weekend: dummy variable if article was published on weekend
18. global_sentiment_polarity: article sentiment polarity
19. global_subjectivity: subjectivity of article conten 
20. abs_title_sentiment_polarity: Absolute polarity level
21. title_subjectivity: Title subjectivity
22. abs_title_subjectivity: Absolute difference of title subjectivity level - 0.5
23. title_sentiment_polarity: Title polarity
24. global_rate_positive_words: Rate of positive words in the article
25. global_rate_negative_words: Rate of negative words in the article
26. rate_positive_words: Rate of positive words among non-neutral tokens
27. rate_negative_words: Rate of negative words among non-neutral tokens
28. avg_positive_polarity: Avg. polarity of positive words in an article
29. min_positive_polarity: Min. polarity of positive words in an article
30. max_positive_polarity: Max. polarity of positive words in an article
31. avg_negative_polarity: Avg. polarity of negative words in an article
32. min_negative_polarity: Min. polarity of negative words in an article
33. max_negative_polarity: Max. polarity of negative words in an article
34. LDA_00: Closeness to LDA topic 0
35. LDA_01: Closeness to LDA topic 1
36. LDA_02: Closeness to LDA topic 2
37. LDA_03: Closeness to LDA topic 3
38. LDA_04: Closeness to LDA topic 4
39. kw_min_min: Worst keyword (min. shares)
40. kw_max_min: Worst keyword (max. shares)
41. kw_avg_min: Worst keyword (avg. shares)
42. kw_min_max: Best keyword (min. shares)
43. kw_max_max: Best keyword (max. shares)
44. kw_avg_max: Best keyword (avg. shares)
45. kw_min_avg: Avg. keyword (min. shares)
46. kw_max_avg: Avg. keyword (max. shares)
47. kw_avg_avg: Avg. keyword (avg. shares)
48. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
49. num_keywords: Number of keywords in the article

## Relevant Links
1. Original Article Link: https://repositorium.sdum.uminho.pt/bitstream/1822/39169/1/main.pdf
2. Useful TextBlob Links:
    1. https://stackabuse.com/sentiment-analysis-in-python-with-textblob/
    2. https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524
3. Datettime reference sheets: 
    1. https://strftime.org/
    2. https://stackoverflow.com/questions/8170982/strip-string-after-third-occurrence-of-character-python
4. LDA Analysis 
    1. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
    2. https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28
    3. https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

## 1. Import Packages

In [1]:
import pandas as pd
import numpy as np
import datetime
from textblob import TextBlob #for polarity and sentiment analysis
import nltk
from nltk.tokenize import word_tokenize  
from nltk.corpus import stopwords
import swifter
from gensim.summarization import summarize, keywords
from pprint import pprint

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

import string

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/haleyfarber/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 2. Read Data

In [2]:
df = pd.read_csv("data_7k.csv") 

In [3]:
df #initial look at dataframe

Unnamed: 0.1,Unnamed: 0,link,title,text,view,topic,time
0,0,https://www.forbes.com/sites/billybambrough/20...,"$50 Billion Crash—What Next For Bitcoin, Ether...","Bitcoin, ethereum, Ripple\s XRP, bitcoin cash,...",47503,Crypto & Blockchain,"Nov 27, 2020, 07:12am"
1,1,https://www.forbes.com/sites/abigailabesamis/2...,12 Bakers Share What They’re Whipping Up Durin...,Sourdough loaves (plus creative uses for disca...,2826,Dining,"May 22, 2020, 11:31am"
2,2,https://www.forbes.com/sites/abigailabesamis/2...,15 Chefs Share What They’re Cooking During The...,In addition to offering delivery and curbside ...,4064,Dining,"Apr 16, 2020, 04:56pm"
3,3,https://www.forbes.com/sites/kyleedward/2020/1...,2021 Genesis GV80 First Drive: The Flagship SU...,Hyundai\s luxury brand Genesis has had a stron...,4264,Cars & Bikes,"Nov 26, 2020, 07:03pm"
4,4,https://www.forbes.com/sites/jasonfogelson/202...,2021 Genesis GV80 Test Drive And Review: Serio...,Meet the 2021 Genesis GV80.The US automotive m...,738,Cars & Bikes,"Nov 29, 2020, 03:05pm"
...,...,...,...,...,...,...,...
7479,1164,https://www.forbes.com/sites/zakdoffman/2020/0...,New Microsoft Security ‘Nightmare’: Users Warn...,SOPA IMAGES/LIGHTROCKET VIA GETTY IMAGESMicros...,60074,Cybersecurity,"Mar 4, 2020, 06:15am"
7480,1165,https://www.forbes.com/sites/zakdoffman/2020/0...,Hackers Attack Microsoft Windows Users: Danger...,GETTYFollowing reports that China has been cau...,11974,Cybersecurity,"Mar 16, 2020, 11:00am"
7481,1166,https://www.forbes.com/sites/zakdoffman/2020/0...,Huawei’s Newest Update—The Ultimate Phone For ...,AFP VIA GETTY IMAGESHuawei has endured a diffi...,356160,Cybersecurity,"Jun 5, 2020, 10:57am"
7482,1167,https://www.forbes.com/sites/zakdoffman/2020/0...,Android Messages And Apple iMessage Beaten By ...,GETTYWhatsApp is on something of a roll at the...,174791,Cybersecurity,"Sep 21, 2020, 07:04pm"


In [4]:
df.dropna(how='any', inplace=True)

In [5]:
df.isnull().sum()

Unnamed: 0    0
link          0
title         0
text          0
view          0
topic         0
time          0
dtype: int64

## 3. Define Functions

In [6]:
#Related to Time Features
def trunc_at(s, d,n):
    "Returns string truncated at the n'th occurrence of the delimiter, d."
    return d.join(s.split(d, n)[:n])
def convert_date(date):
    '''First uses trunc_at to get rid of time of day the article was posted. Then it turns the remaining date into
       integers from 0-6 where 0 represents Sunday.
    '''
    strip_date = trunc_at(date,",",2)
    dummy_date = datetime.datetime.strptime(strip_date, '%b %d, %Y').strftime('%w')
    return int(dummy_date)

In [7]:
#Related to Weekend Dummy Variable Analysis
def weekend_or_not(day):
    '''Returns whether or not an article was published on a weekend or weekday. Day is an argument passed in
       with integers 0-6 where 0 represents Sunday.
    '''
    #0 represents sunday and 6 represents saturday
    if day not in (0,6):
        return "weekday"
    else:
        return "weekend"

In [8]:
#Related to Month Dummy Variable Analysis
def new_convert_date(date):
    '''First uses trunc_at to get rid of time of day the article was posted. Then it turns the remaining date into
       integers from 0-6 where 0 represents Sunday.
    '''
    strip_date = trunc_at(date,",",2)
    dummy_date = datetime.datetime.strptime(strip_date, '%b %d, %Y').strftime('%b')
    return dummy_date

In [9]:
#Related to Polarity Analysis
def find_polarity(words):
    '''Returns the polarity of every word in an article where the argument words are the words in an article.
    '''
    polarity_words = [TextBlob(word).sentiment.polarity for word in words]
    return polarity_words
def find_pos_words(polarity_words):
    '''Returns the polarities of all the positive words in an article with the polarity of all words in an 
       article as function argument. Postive word is defined as a word with a polarity greater than 0.
    '''
    pos_words = [word for word in polarity_words if word > 0] 
    return pos_words
def find_neg_words(polarity_words):
    '''Returns the polarities of all the negative words in an article with the polarity of all words in an 
       article as function argument. Negative word is defined as a word with a polarity less than 0.
    '''
    neg_words = [word for word in polarity_words if word < 0]
    return neg_words
def num_neu_words(polarity_words):
    '''Returns the number of neutral words in an article with the polarity of all word in an article as function
       argument. Neutral word id defined as a word with a polarity equal to 0.
    '''
    neu_words = [word for word in polarity_words if word == 0]
    neu_num = len(neu_words)
    return neu_num

In [10]:
#Related to Uninque Words Analysis
stop_words = stopwords.words("english")
def tokenize(text):
    '''Returns the tokenized words in an article.
    '''
    tokenize_words = word_tokenize(text)
    tokens = [word for word in tokenize_words if word not in stop_words]
    return tokens

In [11]:
# Related to Unique Words Analysis
def preprocess_words(text):
    '''Returns words lowercased and without punctuation.
    '''
    words = text.split()
    table = str.maketrans('', '', string.punctuation)
    stripped_words = [w.translate(table) for w in words]
    words = [word.lower() for word in stripped_words]
    return words
def non_stop_words(text):
    '''Returns non_stop_words in text after applying the preprocess_words function to lowercase the words and 
       remove punctuation.
    '''
    words = preprocess_words(text)
    non_stop_words = [word for word in words if word not in stop_words]
    return non_stop_words

In [12]:
# Related to LDA Analysis 
stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
    '''Lemmatize and stem words. Lemmatizing changes third person words to first person and verbs in past and 
       future tenses to present tenses. Stemming reduces words to their root form.
    '''
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    '''Preprocesses text by lowercasing and tokenizing words through gensim.utils.simple_preprocess and 
       removing stop words and only keeping words greater than three characters.
       Then applies the lemmatize_stemming function to lemmatize and stem words.
    '''
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result
def find_probs(prob_list,num):
    '''Takes in a 2d list with the probabilities for each topic per document with index for topic. Returns 
       just a list of probabilities in order of topics per document.
       num parameter = number of topics for lda model.
    '''
    probs =[prob_list[i][1] for i in range(num)]
    return probs

In [13]:
#Related to TimeDelta Analysis
def timedelta(time,run_date):
    strip_date = trunc_at(time,",",2)
    date = datetime.datetime.strptime(strip_date, '%b %d, %Y')
    timedelta_days = (run_date - date).days
    return timedelta_days

## 4. Setting Up Global Variables For Later Use

In [14]:
title = df["title"] #title of every article
text = df["text"] #content of every article
time = df["time"] #time of publication for every article
views = df["view"] #views for every article

## 5. Feature Variables

### 5.1 Topic Dummy Variables

In [15]:
#Get Topic Dummy Variables
innovation = ["Healthcare, Healthcare", "Cloud", "AI", "Amazon Web Services BrandVoice | Paid Program",
                  "AWS Infrastructure Solutions BrandVoice | Paid Program", "Big Data,Cloud","Cloud 100",
                  "Consumer Tech","Cybersecurity","Enterprise Tech","Games","Green Tech","Healthcare",
                  "Innovation Rules","Japan BrandVoice | Paid Program","Jumio BrandVoice | Paid Program",
                  "SAP BrandVoice | Paid Program","Science","ServiceNow BrandVoice | Paid Program","Social Media",
                  "Tableau BrandVoice | Paid Program","T-Mobile For Business BrandVoice | Paid Program",
                  "Venture Capital","Enterprise & Cloud","Big Data","Tech"]
leadership = ["ForbesWomen, ForbesWomen","Billionaires", "Asia", "World's Billionaires", "Forbes 400","America's Richest", "Self-Made Women", "China's Richest",
                  "India's Richest","Indonesia's Richest","Korea's Richest","Thailand's Richest","Japan's Richest",
                  "Australia'sc Richest","Taiwan's Richest","Singapore's Richest","Philippines' Richest",
                  "Hong Kong's Richest", "Malaysia's Richest", "Money & Politics", "2020 Money", "Careers","CEO Network","CFO Network","CIO Network","CMO Network",
                  "Deloitte BrandVoice | Paid Program","Diversity & Inclusion","Education","Forbes The Culture",
                  "ForbesWomen","Google Cloud BrandVoice | Paid Program","Leadership Strategy","Under 30",
                  "Working Remote"]
money = ["Hedge Funds & Private Equity", "Fintech", "Banking & Insurance","Crypto & Blockchain","ETFs & Mutual Funds","Fintech""Hedge Funds & Private Equity",
         "Investing","Markets","New York Life Investments BrandVoice | Paid Program","Personal Finance",
         "Premium Investing Newsletters","Retirement","Taxes","Tax-Smart Investing","Top Advisor | SHOOK",
         "Wealth Management","Election 2020", "Newsletters"]
business = ["Business, Business","Small Business", "Aerospace & Defense","Energy","Hollywood & Entertainment",
            "Honeywell BrandVoice | Paid Program","Manufacturing","Media",
            "Mitsubishi Heavy Industries BrandVoice | Paid Program","Policy","Real Estate","Retail",
            "Salesforce BrandVoice | Paid Program","SportsMoney","Transportation", "Business As (Un)usual","Entrepreneurs","Franchises","Small Business Strategy","Square BrandVoice | Paid Program",
            "Consumer"]
lifestyle = ["Food & Drink", "Watches & Jewelry", "Arts","Boats & Planes","Cars & Bikes","Dining","ForbesLife","Forbes Travel Guide","Spirits",
             "Style & Beauty","Travel","Vices","Watches","Dining & Drinking"]

new_data = df['topic'].replace(innovation, "Innovation")
new_data = new_data.replace(leadership, "Leadership")
new_data = new_data.replace(money, "Money")
new_data = new_data.replace(business, "Business")
new_data = new_data.replace(lifestyle, "Lifestyle")
new_df = pd.get_dummies(new_data)
df = pd.concat([df,new_df],axis = 1) 
df

Unnamed: 0.1,Unnamed: 0,link,title,text,view,topic,time,Business,Innovation,Leadership,Lifestyle,Money
0,0,https://www.forbes.com/sites/billybambrough/20...,"$50 Billion Crash—What Next For Bitcoin, Ether...","Bitcoin, ethereum, Ripple\s XRP, bitcoin cash,...",47503,Crypto & Blockchain,"Nov 27, 2020, 07:12am",0,0,0,0,1
1,1,https://www.forbes.com/sites/abigailabesamis/2...,12 Bakers Share What They’re Whipping Up Durin...,Sourdough loaves (plus creative uses for disca...,2826,Dining,"May 22, 2020, 11:31am",0,0,0,1,0
2,2,https://www.forbes.com/sites/abigailabesamis/2...,15 Chefs Share What They’re Cooking During The...,In addition to offering delivery and curbside ...,4064,Dining,"Apr 16, 2020, 04:56pm",0,0,0,1,0
3,3,https://www.forbes.com/sites/kyleedward/2020/1...,2021 Genesis GV80 First Drive: The Flagship SU...,Hyundai\s luxury brand Genesis has had a stron...,4264,Cars & Bikes,"Nov 26, 2020, 07:03pm",0,0,0,1,0
4,4,https://www.forbes.com/sites/jasonfogelson/202...,2021 Genesis GV80 Test Drive And Review: Serio...,Meet the 2021 Genesis GV80.The US automotive m...,738,Cars & Bikes,"Nov 29, 2020, 03:05pm",0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7479,1164,https://www.forbes.com/sites/zakdoffman/2020/0...,New Microsoft Security ‘Nightmare’: Users Warn...,SOPA IMAGES/LIGHTROCKET VIA GETTY IMAGESMicros...,60074,Cybersecurity,"Mar 4, 2020, 06:15am",0,1,0,0,0
7480,1165,https://www.forbes.com/sites/zakdoffman/2020/0...,Hackers Attack Microsoft Windows Users: Danger...,GETTYFollowing reports that China has been cau...,11974,Cybersecurity,"Mar 16, 2020, 11:00am",0,1,0,0,0
7481,1166,https://www.forbes.com/sites/zakdoffman/2020/0...,Huawei’s Newest Update—The Ultimate Phone For ...,AFP VIA GETTY IMAGESHuawei has endured a diffi...,356160,Cybersecurity,"Jun 5, 2020, 10:57am",0,1,0,0,0
7482,1167,https://www.forbes.com/sites/zakdoffman/2020/0...,Android Messages And Apple iMessage Beaten By ...,GETTYWhatsApp is on something of a roll at the...,174791,Cybersecurity,"Sep 21, 2020, 07:04pm",0,1,0,0,0


### 5.2 Creating Month Dummies

In [16]:
#Creating month dummies
df["month"] = time.apply(lambda x: new_convert_date(x))
df = df.join(pd.get_dummies(df["month"])) 

In [17]:
sum(df["month"] == "Nov") #checking number of articles in November

1481

### 5.3 Number of Words and Rate of Non-Stop and Unique Words

In [18]:
#Word Count and Rate Analysis
df["n_tokens_title:"] = title.apply(lambda x:len(x.split())) #number of words in title
df["n_tokens_content"] = text.apply(lambda x: len(x.split())) #number of words in article (minus title)
#fraction of unique words in an article
df["n_unique_tokens"] = text.apply(lambda x: len(set(preprocess_words(x)))/len(preprocess_words(x)))
#average length of words in an article
df["average_token_length"] = text.apply(lambda x: sum(len(word) for word in preprocess_words(x)) / len(preprocess_words(x)))
#Rate of non-stop words in the content
df["n_non_stop_words"] = text.apply(lambda x: len(non_stop_words(x))/len(preprocess_words(x)))
#Rate of unique non-stop words in the content
df["n_non_stop_unique_tokens"] = text.apply(lambda x: len(set(non_stop_words(x)))/len(preprocess_words(x)))

### 5.4 Create Day of the Week Dummy Variables

In [19]:
#creates day of the week integers
df["day_of_week"] = time.apply(lambda x: convert_date(x)) #convert all dates into integers 0-6

In [20]:
#creates dummy columns named by the days of the week. day_mapping maps integers to name of days of the week
day_mapping = {0: 'sunday', 1: 'monday', 2: 'tuesday', 3: 'wednesday', 4: 'thursday', 5: 'friday', 6: 'saturday'}
df = df.join(pd.get_dummies(df["day_of_week"].map(day_mapping)))

### 5.5 Create Weekday/Weekend Dummy Variables

In [21]:
df["weekend_or_weekday"] = df["day_of_week"].apply(lambda x: weekend_or_not(x))#article published on weekend/weekday
dummies = pd.get_dummies(df["weekend_or_weekday"]) #makes dummies for weekend/weekday
df = pd.concat([df,dummies],axis = 1) #joins dummy columns to original data frame

### 5.6 Create Sentiment and Polarity Features

In [22]:
#Sentiment and Polarity Analysis
#Article polarity and subjectivity
df["global_sentiment_polarity"] = text.apply(lambda x: TextBlob(x).sentiment.polarity)
df["global_subjectivity"] = text.apply(lambda x: TextBlob(x).sentiment.subjectivity)

#Title polarity and subjectivity
df["title_sentiment_polarity"] = title.apply(lambda x: TextBlob(x).sentiment.polarity)
df["abs_title_sentiment_polarity"] = df["title_sentiment_polarity"].apply(lambda x: abs(x))
df["title_subjectivity"] = title.apply(lambda x: TextBlob(x).sentiment.subjectivity)
df["abs_title_subjectivity"] = df["title_subjectivity"].apply(lambda x: abs(x - 0.5))

In [23]:
#More Polarity Analyis, Max,Min,and Avg
words = text.apply(lambda x: x.split())
polarity_words = words.apply(lambda x: find_polarity(x))

In [24]:
pos_words = polarity_words.apply(lambda x: find_pos_words(x))
neg_words = polarity_words.apply(lambda x: find_neg_words(x))
num_neu_words = polarity_words.apply(lambda x: num_neu_words(x))
num_total_words = words.apply(lambda x: len(x))
num_non_neutral_words = num_total_words - num_neu_words
num_pos_words = pos_words.apply(lambda x: len(x))
num_neg_words = neg_words.apply(lambda x: len(x))

In [25]:
df["global_rate_positive_words"] = num_pos_words/num_total_words
df["global_rate_negative_words"] = num_neg_words/num_total_words
df["rate_positive_words"] = num_pos_words/num_non_neutral_words
df["rate_negative_words"] = num_neg_words/num_non_neutral_words
df["avg_positive_polarity"] = pos_words.apply(lambda x: np.mean(x) if x else 0)
df["min_positive_polarity"] = pos_words.apply(lambda x: min(x) if x else 0)
df["max_positive_polarity"] = pos_words.apply(lambda x: max(x) if x else 0)
df["avg_negative_polarity"] = neg_words.apply(lambda x: np.mean(x) if x else 0)
df["min_negative_polarity"] = neg_words.apply(lambda x: min(x) if x else 0 )
df["max_negative_polarity"] = neg_words.apply(lambda x: max(x) if x else 0)

### 5.7 Create LDA Features

In [26]:
#LDA Features
processed_docs = df["text"].map(preprocess) #preprocess all articles in dataframe
dictionary = gensim.corpora.Dictionary(processed_docs) #make dictionary for words in all articles
#filter out tokens in less than 15 articles or are in more than 50% of articles  and then keep only top 100,000 tokens
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000) 
#Createa a dictionary reporting how many words and how many times those words appear for each article
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs] 
#Train lda model using gensim.models.LdaMulticore using 5 topics
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=2, workers=2)
#get list with the probabilities for each topic per document with index for topic
get_document_topics = [lda_model.get_document_topics(item, minimum_probability = 0) for item in bow_corpus]
#get list of just probabilities for each topic
lda_probs = list(map(lambda x: find_probs(x,5), get_document_topics))
#create data frame with LDA feature names
probs_df = pd.DataFrame(lda_probs, columns = ["LDA_00","LDA_01","LDA_02","LDA_03","LDA_04"])
# Reset index, make sure to print out df to double check the order
df = df.reset_index().sort_index()
df = df.drop(['index'], axis=1)
df = pd.concat([df, probs_df], axis=1) #join LDA features with the main data frame
df.dropna(how='any', inplace=True)
title = df["title"] #title of every article
text = df["text"] #content of every article
time = df["time"] #time of publication for every article
views = df["view"] #views for every article

### 5.8 Timedelta Feature 

In [27]:
#Timedelta Feature 
time_run = 'Dec 1, 2020'
run_date = datetime.datetime.strptime(time_run,'%b %d, %Y')
df["timedelta"] = time.apply(lambda x: timedelta(x,run_date))

### 5.9 Keyword Analysis

In [28]:
df['num_keywords'] = [len(keywords(article)) for article in text] #number of keywords

We did not run these keyword features in this notebook. They take a very long time to run so Alice and Derrick split the data and each ran the keyword analysis on half the features. They then combined their keyword analysis and joined the keyword dataframe to the rest of this features data frame.

In [None]:
#Keywords Analysis
Here we use gensim package to extract keyword for every article from its texts
df['view'] = df['view'].apply(lambda x: int(str(x).replace(",", "").replace(" ", "")))
lst = []
for t in df["text"]:
    lst.append(keywords(t).split('\n'))

In [None]:
# Change keyword view indexing to
def find_views_count(word):
    '''This help function counts the individual views, the total views, total counts, total views / total counts 
       for every word.
    '''
    counts = []
    for i in range(len(lst)):
        text = lst[i]
        if word in text:
            counts.append(df.iloc[i]["view"])
    return (counts, sum(counts), len(counts), sum(counts) / len(counts))
def find_max_min_avg(words):
    '''This function calculates the 9 keyword features.
    '''
    # Temp stores the values of total views / total counts for every word in a list of keywords from a given article
    temp = [dic[word][3] for word in words]
    
    # Getting the best, worst and average word
    best = np.argmax(temp)
    worst = np.argmin(temp)
    avg = np.argsort(temp)[len(temp)//2]
    
    # Getting the individual views for the best, worst and average word
    individual_views_worst = dic[words[worst]][0]
    individual_views_best = dic[words[best]][0]
    individual_views_avg = dic[words[avg]][0]
    
    # Calculating the features
    min_min = min(individual_views_worst)
    min_max = max(individual_views_worst)
    min_avg = np.mean(individual_views_worst)
    max_min = min(individual_views_best)
    max_max = max(individual_views_best)
    max_avg = np.mean(individual_views_best)
    avg_min = min(individual_views_avg)
    avg_max = max(individual_views_avg)
    avg_avg = np.mean(individual_views_avg)
    return [min_min, min_max, min_avg, max_min, max_max, max_avg, avg_min, avg_max, avg_avg]

In [None]:
# Creating the big dictionary 
dic = {}
for words in lst:
    for word in words:
        tup = find_views_count(word)
        dic[word] = tup

In [None]:
# Here we invoked the second help function to put the 9 features into a dataframe.
kw_df = pd.DataFrame({'kw_min_min' : [],'kw_min_max': [],'kw_min_avg': [],'kw_max_min': [],'kw_max_max': [],'kw_max_avg': [], 'kw_avg_min': [], 'kw_avg_max': [],'kw_avg_avg': []})
for words in lst:
    row = find_max_min_avg(words)
    kw_df.loc[len(kw_df)] = row

In [None]:
#Combine keywords with entire dataframe
df = pd.concat([df,kw_df],axis = 1)

## 6. Look at Created Dataframe and Convert to CSV

In [29]:
df #last view at dataframe before making into a csv

Unnamed: 0.1,Unnamed: 0,link,title,text,view,topic,time,Business,Innovation,Leadership,...,avg_negative_polarity,min_negative_polarity,max_negative_polarity,LDA_00,LDA_01,LDA_02,LDA_03,LDA_04,timedelta,num_keywords
0,0,https://www.forbes.com/sites/billybambrough/20...,"$50 Billion Crash—What Next For Bitcoin, Ether...","Bitcoin, ethereum, Ripple\s XRP, bitcoin cash,...",47503,Crypto & Blockchain,"Nov 27, 2020, 07:12am",0,0,0,...,-0.252778,-0.900000,-0.100,0.000631,0.000628,0.000629,0.779793,0.218320,4,316
1,1,https://www.forbes.com/sites/abigailabesamis/2...,12 Bakers Share What They’re Whipping Up Durin...,Sourdough loaves (plus creative uses for disca...,2826,Dining,"May 22, 2020, 11:31am",0,0,0,...,-0.220726,-0.600000,-0.050,0.995311,0.004104,0.000195,0.000194,0.000195,193,1069
2,2,https://www.forbes.com/sites/abigailabesamis/2...,15 Chefs Share What They’re Cooking During The...,In addition to offering delivery and curbside ...,4064,Dining,"Apr 16, 2020, 04:56pm",0,0,0,...,-0.220483,-0.600000,-0.050,0.920161,0.079240,0.000200,0.000199,0.000200,229,1137
3,3,https://www.forbes.com/sites/kyleedward/2020/1...,2021 Genesis GV80 First Drive: The Flagship SU...,Hyundai\s luxury brand Genesis has had a stron...,4264,Cars & Bikes,"Nov 26, 2020, 07:03pm",0,0,0,...,-0.293182,-0.666667,-0.100,0.807626,0.190862,0.000505,0.000503,0.000503,5,455
4,4,https://www.forbes.com/sites/jasonfogelson/202...,2021 Genesis GV80 Test Drive And Review: Serio...,Meet the 2021 Genesis GV80.The US automotive m...,738,Cars & Bikes,"Nov 29, 2020, 03:05pm",0,0,0,...,-0.207833,-0.750000,-0.050,0.787170,0.017800,0.194517,0.000257,0.000256,2,815
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7465,1164,https://www.forbes.com/sites/zakdoffman/2020/0...,New Microsoft Security ‘Nightmare’: Users Warn...,SOPA IMAGES/LIGHTROCKET VIA GETTY IMAGESMicros...,60074,Cybersecurity,"Mar 4, 2020, 06:15am",0,1,0,...,-0.542500,-1.000000,-0.125,0.557103,0.000489,0.199217,0.242703,0.000489,272,489
7466,1165,https://www.forbes.com/sites/zakdoffman/2020/0...,Hackers Attack Microsoft Windows Users: Danger...,GETTYFollowing reports that China has been cau...,11974,Cybersecurity,"Mar 16, 2020, 11:00am",0,1,0,...,-0.193750,-0.500000,-0.100,0.015554,0.000646,0.545519,0.422304,0.015976,260,398
7467,1166,https://www.forbes.com/sites/zakdoffman/2020/0...,Huawei’s Newest Update—The Ultimate Phone For ...,AFP VIA GETTY IMAGESHuawei has endured a diffi...,356160,Cybersecurity,"Jun 5, 2020, 10:57am",0,1,0,...,-0.251389,-0.600000,-0.125,0.459416,0.000716,0.052201,0.474040,0.013627,179,308
7468,1167,https://www.forbes.com/sites/zakdoffman/2020/0...,Android Messages And Apple iMessage Beaten By ...,GETTYWhatsApp is on something of a roll at the...,174791,Cybersecurity,"Sep 21, 2020, 07:04pm",0,1,0,...,-0.266667,-0.500000,-0.125,0.000872,0.460759,0.391265,0.146233,0.000871,71,274


In [30]:
df.columns

Index(['Unnamed: 0', 'link', 'title', 'text', 'view', 'topic', 'time',
       'Business', 'Innovation', 'Leadership', 'Lifestyle', 'Money', 'month',
       'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov',
       'Oct', 'Sep', 'n_tokens_title:', 'n_tokens_content', 'n_unique_tokens',
       'average_token_length', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'day_of_week', 'friday', 'monday', 'saturday', 'sunday', 'thursday',
       'tuesday', 'wednesday', 'weekend_or_weekday', 'weekday', 'weekend',
       'global_sentiment_polarity', 'global_subjectivity',
       'title_sentiment_polarity', 'abs_title_sentiment_polarity',
       'title_subjectivity', 'abs_title_subjectivity',
       'global_rate_positive_words', 'global_rate_negative_words',
       'rate_positive_words', 'rate_negative_words', 'avg_positive_polarity',
       'min_positive_polarity', 'max_positive_polarity',
       'avg_negative_polarity', 'min_negative_polarity',
       'max_negative_pola

In [31]:
df.to_csv("Features.csv")