In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import langdetect
from langdetect import detect
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jordansamek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [76]:
titles = pd.read_csv('../data/clean_titles.csv')
titles.head()

Unnamed: 0,clean_title
0,35 affordable things make feel oh-so-fancy
1,"would die ""game thrones?"""
2,harry sally? john mclaine holly? quiz determin...
3,25 times harry styles went way make fans hella...
4,"penn badgley weighed wild finale ""you"" season ..."


So I've left the titles still a little bit messy for the purpose of visualization, specifically for the use of `pyLDAvis`. Now we can clean it properly for the sake of our models that will be using the data.

A preprocess function was already created in the `headline_eda` notebook, this is a modification of that function. We don't need all of the same methods that we used previously because we already have stop words removed, all of the words are tokenized, stemmed, and lowercased. What we're doing now is just making sure everything is lowercased, removing punctuation, and getting rid of any other non-alphanumeric characters.

In [77]:
def preprocess(df):
    df['clean_title'] = df['clean_title'].str.lower()
    df['clean_title'] = df['clean_title'].str.replace('[^a-zA-Z0-9]', ' ') # replace non-alphanumeric characters with spaces
    df['clean_title'] = df['clean_title'].str.replace('\s+', ' ') # replace multiple spaces with a single space
    df['clean_title'] = df['clean_title'].str.strip() # remove leading and trailing spaces
    return df

In [78]:
titles = preprocess(titles)
titles.head()

Unnamed: 0,clean_title
0,35 affordable things make feel oh so fancy
1,would die game thrones
2,harry sally john mclaine holly quiz determine ...
3,25 times harry styles went way make fans hella...
4,penn badgley weighed wild finale you season 3 ...


In [79]:
titles.to_csv('../data/preprocessed_titles.csv', index=False)

In [26]:
import markovify

In [27]:
text_model = markovify.NewlineText(titles['clean_title'], state_size=2)

In [30]:
for _ in range(4):
    print(text_model.make_sentence())

33 subtly sexist things women like them honestly messed
film passing seen cast hawkeye
again nick cannon plotting kids time makes zero sense today
tbh 30 red carpet moments prove none us allowed watch high school


Reading the github page for `markovify`, the author says that the library works best with large, well-punctuated texts. Maybe I could try and plug in the original data we had before cleaning it in this notebook?

In [31]:
headlines_df = pd.read_csv('../data/buzzfeed_headlines.csv')
headlines_df.head()

Unnamed: 0,content,description,title
0,"""In group projects, no boy ever asks for me to...",AAAAAHH!!!View Entire Post ›,33 Subtly Sexist Things Women Deal With That O...
1,"u/minipenguz\r\n""It goes from reading TO your ...",Literally about to cross-stitch all of these a...,People Are Sharing Their Random Bits Of Life A...
2,Get all the best moments in pop culture &amp; ...,"""My log does not judge.""View Entire Post ›","There's A ""Twin Peaks"" Resident In Us All, Whi..."
3,"""I havent brought it up at all since weve been...","""I hate that I’m so uncomfortable. I wish I wa...",What Do You Do When Your Identical Twin Starts...
4,"""Last time Queensland went through mandatory i...","""I'm quitting a job that earns me roughly $140...","In The Wake Of ""The Great Resignation"", Aussie..."


In [32]:
headlines_df = headlines_df[['title']]

In [36]:
def detect_lang(txt):
    try:
        return detect(txt)
    except:
        return np.nan
    
headlines_df['language'] = headlines_df.title.apply(detect_lang)
headlines_df = headlines_df[headlines_df.language == "en"]
headlines_df = headlines_df.drop('language', axis=1)
headlines_df.head()

Unnamed: 0,title
0,33 Subtly Sexist Things Women Deal With That O...
1,People Are Sharing Their Random Bits Of Life A...
2,"There's A ""Twin Peaks"" Resident In Us All, Whi..."
3,What Do You Do When Your Identical Twin Starts...
4,"In The Wake Of ""The Great Resignation"", Aussie..."


In [48]:
def preprocess_2(df):
    df['title'] = df['title'].str.lower()
    df['title'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
    return df



In [49]:
clean_headlines = preprocess_2(headlines_df)
clean_headlines.sample(5)

Unnamed: 0,title
198,people practice polyamory — rules put place re...
82,"film tv crews, want hear working conditions set"
80,billie lourd got super honest generational tra...
188,"i'm sorry, literally 40 things growing up, con..."
111,"could make across survive glass bridge ""squid ..."


In [50]:
clean_headlines.sample(15)

Unnamed: 0,title
107,jamie lee curtis revealed lindsay lohan secret...
86,"""reba"" first aired 20 years ago, here's cast l..."
341,"31 food photos horrific, they'll hurt eyes eve..."
61,"woodworkers get credit deserve, 25 photos prove"
36,34 nice things home
105,"""penn badgley hot joe goldberg not"" 26 great t..."
335,"female ride-share driver, we'd like hear exper..."
329,wanna know dominant personality trait? eat ice...
270,90s rom-coms 100% make life better
376,32 products fitness-related aches pains


In [60]:
text_model_2 = markovify.Text(headlines_df['title'], state_size=2)

In [68]:
for _ in range(4):
    print(text_model_2.make_sentence())

lauren ridloff made history cfda fashion awards
put ranch 13 foods, i'm sorry, literally 40 things growing up, congratulations! officially old
rihanna perfectly replicated instagram post gunna halloween 2021 something nutty say
None


Nothing really changed after only slightly pre-processing the data. We even get the same results for some headlines. I know the dataset overall isn't large enough to have sufficient training to be able to produce good headlines.

Let's reload our `preprocessed_titles` data and try a simple character-level generation model.

In [3]:
text = pd.read_csv('../data/preprocessed_titles.csv')
text.sample(5)

Unnamed: 0,clean_title
366,mindy kaling dressed issa rae elle woods ali w...
98,14 people told us student loan debt would chan...
25,treated differently work disability
37,identify beloved tv characters we ll reveal wh...
303,18 celebrity instagrams probably missed last week


In [4]:
import tensorflow as tf

In [5]:
chars = tf.strings.unicode_split(text['clean_title'], input_encoding='utf-8')
chars

2022-01-13 13:21:04.619534: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


InvalidArgumentError: Value for attr 'output_encoding' of "utf-8" is not in the list of allowed values: "UTF-8", "UTF-16-BE", "UTF-32-BE"
	; NodeDef: {{node UnicodeEncode}}; Op<name=UnicodeEncode; signature=input_values:int32, input_splits:Tsplits -> output:string; attr=errors:string,default="replace",allowed=["ignore", "replace", "strict"]; attr=output_encoding:string,allowed=["UTF-8", "UTF-16-BE", "UTF-32-BE"]; attr=replacement_char:int,default=65533; attr=Tsplits:type,default=DT_INT64,allowed=[DT_INT32, DT_INT64]> [Op:UnicodeEncode]