In [1]:
import numpy as np
import pandas as pd
import json

### From `data/books/README`:

This README describes data in the CMU Book Summary Corpus, a collection of 16,559 book plot summaries extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

# DATA

booksummaries.txt

Plot summaries of 16,559 books extracted from the November 2, 2012 dump of English-language Wikipedia.  Tab-separated; columns:

1. Wikipedia article ID
2. Freebase ID
3. Book title
4. Author
5. Publication date
6. Book genres (Freebase ID:name tuples)
7. Plot summary


In [2]:
def parse_json(s):
    try:
        return list(json.loads(s).values())[0]
    except Exception as ex:
        return None

In [3]:
book_df = pd.read_csv('data/books/booksummaries.txt', delimiter='\t', header=None,
                      names=['wid', 'fbid', 'title', 'author', 'pubdate', 'genre', 'summary'])
book_df['genre'] = book_df['genre'].apply(parse_json)
books = book_df.loc[(~book_df['wid'].isna())&(~book_df['title'].isna())&(~book_df['author'].isna())&\
                    (~book_df['pubdate'].isna())&(~book_df['genre'].isna())&(~book_df['summary'].isna())]

In [4]:
np.random.seed(4448)
books = books.sample(500)[['wid', 'title', 'author', 'pubdate', 'genre', 'summary']]

In [5]:
books.head()

Unnamed: 0,wid,title,author,pubdate,genre,summary
14643,24930961,One in Three Hundred,J. T. McIntosh,1953,Science Fiction,Set in the near future when a scientific prin...
5416,4296173,Tribulation Force,Tim LaHaye,1996-10,Science Fiction,"Rayford Steele, Chloe Steele, Buck Williams a..."
5752,4649413,White Mughals,William Dalrymple,2002-03-29,History,The book is a work of social history about th...
16080,31672195,A Taste of Blackberries,Doris Buchanan Smith,1973-05,Children's literature,As told from the point of view of the unnamed...
554,189018,Elbow Room,Daniel Dennett,1984,Philosophy,A major task taken on by Dennett in Elbow Roo...


### From `data/movies/README.txt`:

This README describes data in the CMU Movie Summary Corpus, a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age).  This data supports work in the following paper:

David Bamman, Brendan O'Connor and Noah Smith, "Learning Latent Personas of Film Characters," in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

# DATA

1. plot_summaries.txt.gz [29 M] 

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

...

# METADATA

3. movie.metadata.tsv.gz [3.4 M]


Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)

In [6]:
def get_summaries(wid, summary_df):
    try:
        return summary_df.loc[wid]['summary']
    except Exception as ex:
        return None

In [7]:
movie_df = pd.read_csv('data/movies/movie.metadata.tsv', delimiter='\t', header=None,
                       names=['wid', 'fbid', 'title', 'releasedate', 'boxofficerev', 
                              'runtime', 'langs', 'countries', 'genre'])
movie_df['genre'] = movie_df['genre'].apply(parse_json)
summary_df = pd.read_csv('data/movies/plot_summaries.txt', delimiter='\t', header=None,
                         names=['wid', 'summary'])
summary_df.index = summary_df['wid']
movie_df['summary'] = movie_df['wid'].apply(lambda x:get_summaries(x, summary_df))

In [8]:
movies = movie_df.loc[(~movie_df['wid'].isna())&(~movie_df['title'].isna())&\
                      (~movie_df['releasedate'].isna())&(~movie_df['runtime'].isna())&\
                      (~movie_df['langs'].isna())&(~movie_df['countries'].isna())&\
                      (~movie_df['genre'].isna())&(~movie_df['summary'].isna())]

In [9]:
np.random.seed(4448)
movies = movies.sample(200)[['wid', 'title', 'releasedate', 'runtime', 'genre', 'summary']]

In [10]:
movies.head()

Unnamed: 0,wid,title,releasedate,runtime,genre,summary
17039,3350098,Sam Whiskey,1967-04-28,97.0,Action/Adventure,The husband of Laura Breckenridge has stolen $...
37866,803864,Sitcom,1998,80.0,LGBT,The patriarch of a seemingly normal nuclear fa...
40605,10791882,Ilamai Oonjal Aadukirathu,1978,141.0,Romance Film,Murli and Prabhu are owner and manager respe...
76122,17607081,An Unforgettable Summer,1994,94.0,Romance Film,"The film's plot, which develops as a flashback..."
35719,2203406,Fright Night II,1988,108.0,Horror,Fright Night Part II begins after the events i...


### Notes: composing item table

### Check `wid`/`fbid` to see if can use as `item_id`:

In [11]:
assert len(set.union(set(movies['wid'].values), set(books['wid'].values))) \
       == (len(movies['wid'].values)+len(books['wid'].values))

assert not movies['wid'].isna().any()
assert not books['wid'].isna().any()
assert movies['wid'].values.dtype == int
assert books['wid'].values.dtype == int

### Maximum length of movie/book titles:

In [12]:
movies['title'].apply(len).max(), books['title'].apply(len).max()

(40, 111)

Make max 150 characters to be safe

### Maximum length of book author:

In [13]:
books['author'].apply(len).max()

43

Make max 60 characters to be safe

### Maximum length of book/movie genre:

In [14]:
movies['genre'].apply(len).max(), books['genre'].apply(len).max()

(20, 24)

Make max 30 characters each to be safe 

### Maximum length of book/movie summary:

In [15]:
movies['summary'].apply(len).max(), books['summary'].apply(len).max()

(13978, 16214)

Should use text data type here?

### Final adjustments to dataframes

In [16]:
movies = movies.rename(columns = {"wid": "item_id", "releasedate": "release_date"})
books = books.rename(columns = {"wid": "item_id", "pubdate": "pub_date"})

In [17]:
movies.head()

Unnamed: 0,item_id,title,release_date,runtime,genre,summary
17039,3350098,Sam Whiskey,1967-04-28,97.0,Action/Adventure,The husband of Laura Breckenridge has stolen $...
37866,803864,Sitcom,1998,80.0,LGBT,The patriarch of a seemingly normal nuclear fa...
40605,10791882,Ilamai Oonjal Aadukirathu,1978,141.0,Romance Film,Murli and Prabhu are owner and manager respe...
76122,17607081,An Unforgettable Summer,1994,94.0,Romance Film,"The film's plot, which develops as a flashback..."
35719,2203406,Fright Night II,1988,108.0,Horror,Fright Night Part II begins after the events i...


In [18]:
books.head()

Unnamed: 0,item_id,title,author,pub_date,genre,summary
14643,24930961,One in Three Hundred,J. T. McIntosh,1953,Science Fiction,Set in the near future when a scientific prin...
5416,4296173,Tribulation Force,Tim LaHaye,1996-10,Science Fiction,"Rayford Steele, Chloe Steele, Buck Williams a..."
5752,4649413,White Mughals,William Dalrymple,2002-03-29,History,The book is a work of social history about th...
16080,31672195,A Taste of Blackberries,Doris Buchanan Smith,1973-05,Children's literature,As told from the point of view of the unnamed...
554,189018,Elbow Room,Daniel Dennett,1984,Philosophy,A major task taken on by Dennett in Elbow Roo...


Save dataframes

In [19]:
movies.to_csv("out/movies.csv", index=None)
books.to_csv("out/books.csv", index=None)