## DS Final Project
Imports and definitions

In [1]:
import pandas as pd

from textblob import TextBlob
from nltk.corpus import stopwords
from collections import defaultdict, Counter

Method to turn a sentence into a bag of words

In [2]:
def text_to_vector(fulltext):
    """
    Method that removes stop-words, lowercases and converts words to
    their base lemmas (i.e., bending -> bend; shot -> shoot).
    """
    # Remove unicode
    fulltext = fulltext.decode('unicode_escape').encode('ascii', 'ignore')
    
    # Load English stop-words
    # Note: this will not work for shows that are in different language categories, e.g., Latino
    stop_words = set(stopwords.words('english'))

    # Take the base word (lemma) of each word
    bag_of_words = [i.lemma for i in TextBlob(fulltext).words if i != "'s"]

    # Parts-of-speech: get all the words in lower case
    bag_of_words = [i.lower() for i in bag_of_words]

    # Remove all of the stop words
    bag_of_words = [word for word in bag_of_words if word not in stop_words]
        
    return bag_of_words

Remove any commas in the numbers and convert to floats for later analysis

In [3]:
df = pd.read_csv('ds_final2.csv')

def remove_comma(x):
    x = x.replace(',', '')
    return x

df[' average monthly total users (June - Sep 17) '] = df[' average monthly total users (June - Sep 17) '].apply(remove_comma).astype(float)
df.head(5)

Unnamed: 0,series id,series title,description,site channel,primary channel sub channel,average monthly total users (June - Sep 17)
0,34022,Rick and Morty,Rick and Morty is a show about a sociopathic s...,Animation and Cartoons~Primetime Animation|Teen,Animation and Cartoons~Primetime Animation,706040.0
1,46885,The Orville,From Emmy Award-winning executive producer and...,Science Fiction~Science Fiction - Space|Comedy,Science Fiction~Science Fiction - Space,627108.0
2,17424,South Park,"Underpants-stealing gnomes, a talking Christma...",Comedy|Animation and Cartoons~Primetime Animation,Comedy,575884.0
3,47614,Will & Grace (1998),Will and Grace are best friends and roommates....,Comedy~Sitcoms|,Comedy~Sitcoms,534870.0
4,15319,Bob's Burgers,Bob runs Bob's Burgers with the help of his wi...,Animation and Cartoons~Primetime Animation|Comedy,Animation and Cartoons~Primetime Animation,474963.0


Create a column to contain the bag of words for descriptions

In [4]:
df['description_bow'] = df['description'].apply(text_to_vector)
df[['description', 'description_bow']].head(5)

Unnamed: 0,description,description_bow
0,Rick and Morty is a show about a sociopathic s...,"[rick, morty, show, sociopathic, scientist, dr..."
1,From Emmy Award-winning executive producer and...,"[emmy, award-winning, executive, producer, cre..."
2,"Underpants-stealing gnomes, a talking Christma...","[underpants-stealing, gnome, talking, christma..."
3,Will and Grace are best friends and roommates....,"[grace, best, friend, roommate, pal, karen, ja..."
4,Bob runs Bob's Burgers with the help of his wi...,"[bob, run, bob, burgers, help, wife, three, ki..."


Group over each site channel, find the show with the highest monthly total users and compare its most used word in its description to that of the group.

In [5]:
for i, grp in df.groupby('site channel'):
    
    # Top show in grouping
    top_title = grp.loc[grp[' average monthly total users (June - Sep 17) '].argmax()]
    
    # Count number of times a word is used
    word_counts = Counter(top_title['description_bow'])

    # Get top 2 most used words
    (top_1, count_1), (top_2, count_2) = word_counts.most_common(2)
    print('Site channel: {}'.format(grp['site channel'].values[0]))
    print('Number of titles: {}'.format(grp['series title'].size))
    print('Top show: {}'.format(top_title['series title']))
    print('Top word: ({} times), {}'.format(count_1, top_1))
    print('Second top word: ({} times), {}'.format(count_2, top_2))

    # Count word usage across all shows in the group
    grp_dict = defaultdict(int)
    for bow in grp['description_bow']:
        for word in bow:
            grp_dict[word] += 1

    # Invert the dictionary to find the most used words
    inverse = defaultdict(list)
    for k, v in grp_dict.items():
        inverse[v].append(k)
    max_key = max(inverse.keys())
    
    print('Content top used words: ({} times), {}'.format(max_key, ', '.join(inverse[max_key])))
    print('')

Site channel: Action and Adventure
Number of titles: 4
Top show: Matador
Top word: (1 times), spy
Second top word: (1 times), cia
Content top used words: (3 times), team, agent

Site channel: Action and Adventure|Classics
Number of titles: 1
Top show: The Saint (TV)
Top word: (2 times), inspector
Second top word: (2 times), rich
Content top used words: (2 times), inspector, rich, criminal, saint, teal, steal

Site channel: Action and Adventure|Drama
Number of titles: 1
Top show: Highlander: The Raven
Top word: (1 times), achieved
Second top word: (1 times), thief
Content top used words: (1 times), achieved, thief, immortal, street, roamed, year, derieux, immortality, preternatural, smart, evil, using, ha, true, countless, king, 1,200, cavorting, world, blade, evade, romancing, amanda

Site channel: Action and Adventure|Drama|Family
Number of titles: 1
Top show: Legend of the Seeker
Top word: (1 times), tyranny
Second top word: (1 times), begin
Content top used words: (1 times), tyranny

Content top used words: (4 times), wa, one, satoru

Site channel: Anime|Action and Adventure|Videogames
Number of titles: 1
Top show: Pok�mon the Series: Ruby and Sapphire
Top word: (6 times), team
Second top word: (2 times), aqua
Content top used words: (6 times), team

Site channel: Anime|Animation and Cartoons
Number of titles: 1
Top show: Astro Boy
Top word: (1 times), repulse
Second top word: (1 times), series
Content top used words: (1 times), repulse, series, pinocchio-like, japanese, son, alien, scientist, earth, laser-firing, complete, renowned, superhero, jet-powered, boot, character, research, secret, permanently, finger, tell, uncanny, eventually, used, intended, becomes, device, invasion, hearing, boy, kept, like, publicly, originally, modeled, robot, tale, youthful, animated, deceased

Site channel: Anime|Animation and Cartoons|International~Japanese
Number of titles: 1
Top show: Attack on Titan: Junior High
Top word: (1 times), school
Second top word: (1 times), favorite

Top show: Haikyu!!
Top word: (4 times), team
Second top word: (3 times), high
Content top used words: (4 times), team

Site channel: Anime~Anime - Tournament|International~Japanese
Number of titles: 3
Top show: The Future Diary
Top word: (2 times), one
Second top word: (2 times), diary
Content top used words: (3 times), astral, yuma

Site channel: Anime~Anime - Tournament|Teen|Videogames
Number of titles: 1
Top show: Yu-Gi-Oh!
Top word: (2 times), high
Second top word: (2 times), duel
Content top used words: (2 times), high, duel, game, yugi

Site channel: Arts and Culture|Lifestyle~Travel|Family
Number of titles: 1
Top show: Rick Steves' Europe
Top word: (1 times), perfect
Second top word: (1 times), city
Content top used words: (1 times), perfect, city, television, series, steves, destination, favorite, public, off-the-beaten-path, rick, guiding, village, popular, partner, travel, european

Site channel: Classics|Comedy
Number of titles: 1
Top show: The Three Stooges Classics
Top wor


Site channel: Drama|Horror and Suspense
Number of titles: 1
Top show: Criminal Minds
Top word: (2 times), criminal
Second top word: (1 times), analyze
Content top used words: (2 times), criminal

Site channel: Drama|Horror and Suspense~Paranormal|Family
Number of titles: 1
Top show: Ghost Whisperer
Top word: (1 times), woman
Second top word: (1 times), restless
Content top used words: (1 times), woman, restless, communicates, young, dead, preventing, resolve, passing, spirit, conflict, help

Site channel: Drama|Horror and Suspense~Zombies
Number of titles: 1
Top show: Fear the Walking Dead
Top word: (2 times), walking
Second top word: (2 times), dead
Content top used words: (2 times), walking, dead

Site channel: Drama|International|International~Australian
Number of titles: 1
Top show: McLeod's Daughters
Top word: (2 times), together
Second top word: (1 times), cattle
Content top used words: (2 times), together

Site channel: Drama|International~Australian
Number of titles: 1
Top sho

Top show: Classroom of the Elite
Top word: (6 times), student
Second top word: (4 times), school
Content top used words: (6 times), student

Site channel: International~Japanese|Anime~Anime - Drama|Anime~Anime - Action Adventure
Number of titles: 2
Top show: Mobile Suit Gundam 00
Top word: (2 times), force
Second top word: (2 times), fighting
Content top used words: (3 times), alliance, force, earth

Site channel: International~Japanese|Anime~Anime - Drama|Anime~Anime - Comedy
Number of titles: 1
Top show: SAIYUKI RELOAD BLAST
Top word: (3 times), continent
Second top word: (2 times), yokai
Content top used words: (3 times), continent

Site channel: International~Japanese|Anime~Anime - Drama|Anime~Anime - Romance
Number of titles: 1
Top show: Oreshura
Top word: (3 times), eita
Second top word: (3 times), school
Content top used words: (3 times), eita, school

Site channel: International~Japanese|Anime~Anime - Drama|Anime~Anime - Science Fiction
Number of titles: 1
Top show: Space Broth