# Assignment

Read text from URL and implement the following steps

1. clean the text
2. find the count of all words
3. Plot the most 15 most frequent words in the text
4. Extract some information such as time, numbers, or any other information you find important



In [1]:
#!pip install wikipedia

In [75]:
# project setup
import regex as re
import wikipedia
from collections import Counter
import pandas as pd
import nltk
from tqdm import tqdm
import numpy as np

# Enable tqdm progress_apply for pandas
tqdm.pandas()

## Read from URL

In [3]:
def read_wikipedia_page(page_title):
    try:
        page_content = wikipedia.page(page_title).content
        return page_content
    except wikipedia.exceptions.PageError:
        print("Page not found.")
        return None
    except wikipedia.exceptions.DisambiguationError as e:
        print("Disambiguation Error: ", e.options)
        return None


In [4]:
# Example usage
page_title = "Dookie)"
page_content = read_wikipedia_page(page_title)
print(page_content)


Dookie is the third studio album by the American rock band Green Day, released on February 1, 1994, by Reprise Records. The band's major label debut and first collaboration with producer Rob Cavallo, it was recorded in mid-1993 at Fantasy Studios in Berkeley, California. Written mostly by frontman and guitarist Billie Joe Armstrong, the album is largely based on his personal experiences and includes themes such as boredom, anxiety, relationships, and sexuality. It was promoted with four singles: "Longview", "Basket Case", a re-recorded version of "Welcome to Paradise" (which originally appeared on the band's second studio album, 1991's Kerplunk), and "When I Come Around".
After several years of grunge's dominance in popular music, Dookie brought a livelier, more melodic rock sound to the mainstream and propelled Green Day to worldwide fame. Considered one of the defining albums of the 1990s and of punk rock in general, it was also pivotal in solidifying the genre's mainstream popularit

## Get Word Count

In [5]:
def tokenize(text):
    return re.findall(r'[\w-]*\p{L}[\w-]*', text)

In [6]:
# tokenize the page content
tokens = tokenize(page_content)

In [7]:
print(f'There are {len(tokens)} words in the the text')

There are 5210 words in the the text


In [8]:
# load stopwords from nltk package
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stop(tokens):
    return [t for t in tokens if t.lower() not in stopwords]

In [9]:
# remove any stopwords from the tokens
tokens = remove_stop(tokens)

In [10]:
print(f'There are {len(tokens)} words  once stopwords are removed')

There are 2991 words  once stopwords are removed


In [11]:
# convert tokens list to dataframe
df = pd.DataFrame(tokens, columns = ['tokens'])
df.head()

Unnamed: 0,tokens
0,Dookie
1,third
2,studio
3,album
4,American


## Plot the most 15 most frequent words

In [12]:
counter = Counter(df['tokens'])
df_freq = pd.DataFrame.from_dict(counter, orient='index', columns=['Count'])
df_freq = df_freq.sort_values('Count', ascending = False)
df_freq.head(15)

Unnamed: 0,Count
album,63
band,62
Armstrong,44
Dookie,38
song,26
Green,21
Day,21
punk,20
later,15
music,14


## Extract some information such as time, numbers, or any other information you find important

In [15]:
def find_years(text):
    # expression to identify yeasr 1900 to 2030 in text
    pattern = r'\b(19\d{2}|20[012]\d|2030)\b'
    years = re.findall(pattern, text)
    return years


In [20]:
# Find the years in the first 100 characters of the text
years_found = find_years(page_content[0:100])
print("Years found in the text:", years_found)

Years found in the text: ['1994']


# Part 2
1. put the extracted information in a dataframe
2. could you get any insight or correlation among the words?  Show the highest 2 and 3 frequent n-grams
3. Could you produce weights for all the words using TF-IDF
4. What are the frequent words and TF-IDF values for them in each document.

In [26]:
def ngrams(tokens, n=2, sep=' '):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])]

In [27]:
def ngrams(tokens, n=2, sep=' ', stopwords=set()):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])
            if len([t for t in ngram if t in stopwords])==0]

In [31]:
counter = Counter(ngrams(tokens))
df_ngram2 = pd.DataFrame.from_dict(counter, orient='index', columns=['Count'])
df_ngram2 = df_ngram2.sort_values('Count', ascending = False)
df_ngram2.head(15)

Unnamed: 0,Count
Green Day,21
Basket Case,8
punk rock,8
Welcome Paradise,7
Come Around,5
band members,5
music video,4
Billie Joe,4
Joe Armstrong,4
New York,4


In [32]:
counter = Counter(ngrams(tokens, n=3))
df_ngram3 = pd.DataFrame.from_dict(counter, orient='index', columns=['Count'])
df_ngram3 = df_ngram3.sort_values('Count', ascending = False)
df_ngram3.head(15)

Unnamed: 0,Count
Billie Joe Armstrong,4
Welcome Paradise third,2
greatest punk rock,2
one greatest punk,2
album final single,2
originally appeared band,2
appeared band second,2
second studio album,2
studio album Kerplunk,2
first single Longview,2


## IDF

In [57]:
def compute_idf(df, column='tokens', preprocess=None, min_df=2):

    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(set(tokens))

    # count tokens
    counter = Counter()
    df[column].progress_map(update)

    # create data frame and compute idf
    idf_df = pd.DataFrame.from_dict(counter, orient='index', columns=['df'])
    idf_df = idf_df.query('df >= @min_df')
    idf_df['idf'] = np.log(len(df)/idf_df['df'])+0.1
    idf_df.index.name = 'token'
    return idf_df

In [58]:
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # create counter and run through all data
    counter = Counter()
    df[column].progress_map(update)

    # transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
    freq_df = freq_df.query('freq >= @min_freq')
    freq_df.index.name = 'token'
    
    return freq_df.sort_values('freq', ascending=False)

In [68]:
pipeline = [str.lower, tokenize, remove_stop]

def prepare(text, pipeline):
    tokens = text
    for transform in pipeline:
        tokens = transform(tokens)
    return tokens

In [81]:
# Example usage
page_title = "Green Day)"
green_day_page_content = read_wikipedia_page(page_title)
print(green_day_page_content)


Green Day is an American rock band formed in the East Bay of California in 1987 by lead vocalist and guitarist Billie Joe Armstrong, together with bassist and backing vocalist Mike Dirnt. For most of the band's career, they have been a power trio with drummer Tré Cool, who replaced John Kiffmeyer in 1990 before the recording of the band's second studio album, Kerplunk (1991). Before taking its current name in 1989, Green Day was called Blood Rage, then Sweet Children. They were part of the late 1980s/early 1990s Bay Area punk scene that emerged from the 924 Gilman Street club in Berkeley, California. The band's early releases were with the independent record label Lookout! Records. In 1994, their major-label debut Dookie, released through Reprise Records, became a breakout success and eventually shipped over 10 million copies in the U.S. Alongside fellow California punk bands Bad Religion, the Offspring, Rancid, NOFX, Pennywise and Social Distortion, Green Day is credited with populari

In [85]:
# Example usage
page_title = "Eminem)"
eminem_page_content = read_wikipedia_page(page_title)
print(eminem_page_content)


Marshall Bruce Mathers III (born October 17, 1972), known professionally as Eminem, is an American rapper. He is credited with popularizing hip hop in Middle America and is often regarded as one of the greatest rappers of all time. His global success and acclaimed works are widely regarded as having broken racial barriers for the acceptance of white rappers in popular music. While much of his transgressive work during the late 1990s and early 2000s made him a controversial figure, he came to be a representation of popular angst of the American underclass and has been cited as an influence by and upon many artists working in various genres.
After the release of his debut album Infinite (1996) and the extended play Slim Shady EP (1997), Eminem signed with Dr. Dre's Aftermath Entertainment and subsequently achieved mainstream popularity in 1999 with The Slim Shady LP. His next two releases, The Marshall Mathers LP (2000) and The Eminem Show (2002), were worldwide successes and were both n

In [86]:
# Example usage
page_title = "Ice Cube)"
cube_page_content = read_wikipedia_page(page_title)
print(cube_page_content)


O'Shea Jackson Sr. (born June 15, 1969), better known as Ice Cube, is an American rapper, songwriter, actor, and film producer. His lyrics on N.W.A's 1988 album Straight Outta Compton contributed to gangsta rap's widespread popularity, and his political rap solo albums AmeriKKKa's Most Wanted (1990), Death Certificate (1991), and The Predator (1992) were all critically and commercially successful. He was inducted into the Rock and Roll Hall of Fame as a member of N.W.A in 2016.A native of Los Angeles, Ice Cube formed his first rap group called C.I.A. in 1986. In 1987, with Eazy-E and Dr. Dre, he formed the gangsta rap group N.W.A. As its lead rapper, he wrote some of Dre's and most of Eazy's lyrics on Straight Outta Compton, a landmark album that shaped West Coast hip hop's early identity and helped differentiate it from East Coast rap. N.W.A was also known for their violent lyrics, threatening to attack abusive police which stirred controversy. After a monetary dispute over the group'

In [90]:
# convert the 3 wikipedia page texts into a dataframe
df = pd.DataFrame({'text': [green_day_page_content, eminem_page_content, cube_page_content]})
df

Unnamed: 0,text
0,Green Day is an American rock band formed in t...
1,"Marshall Bruce Mathers III (born October 17, 1..."
2,"O'Shea Jackson Sr. (born June 15, 1969), bette..."


In [92]:
# apply the pipeline to the text column
# this extracts tokens, lowercases them and removes stopwords
df['tokens'] = df['text'].progress_apply(prepare, pipeline=pipeline)
df

100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 138.24it/s]


Unnamed: 0,text,tokens
0,Green Day is an American rock band formed in t...,"[green, day, american, rock, band, formed, eas..."
1,"Marshall Bruce Mathers III (born October 17, 1...","[marshall, bruce, mathers, iii, born, october,..."
2,"O'Shea Jackson Sr. (born June 15, 1969), bette...","[shea, jackson, sr, born, june, better, known,..."


In [93]:
idf_df = compute_idf(df)
idf_df

100%|██████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1500.82it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idf_df['idf'] = np.log(len(df)/idf_df['df'])+0.1


Unnamed: 0_level_0,df,idf
token,Unnamed: 1_level_1,Unnamed: 2_level_1
fifth,3,0.100000
began,3,0.100000
upcoming,2,0.505465
leading,2,0.505465
listen,2,0.505465
...,...,...
collaborating,2,0.505465
dark,2,0.505465
row,2,0.505465
pull,2,0.505465
