## Gathering data from text file

In [2]:
import glob
import pandas as pd

 [glob library](https://docs.python.org/3/library/glob.html), which makes opening files with similar path structure (like our folder of Roger Ebert review text files) simple.
 
 - [glob.glob(pathname, *, recursive=False)](https://docs.python.org/3/library/glob.html#glob.glob) - Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).
 
-  

In [6]:
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding = 'utf-8') as fh:
        print(fh.readline())
        break

The Wizard of Oz (1939)



To remove the space above caused due to newline character.

In [7]:
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding = 'utf-8') as fh:
        print(fh.readline()[:-1])
        break

The Wizard of Oz (1939)


Hence we can get the title. Now to grab the review url and the full-text.

In [9]:
# url
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding = 'utf-8') as fh:
        title = fh.readline()[:-1]
        review_url = fh.readline()[:-1]
        print(review_url)
        break

http://www.rogerebert.com/reviews/great-movie-the-wizard-of-oz-1939


In [12]:
# full text
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding = 'utf-8') as fh:
        title = fh.readline()[:-1]
        review_url = fh.readline()[:-1]
        full_text = fh.read()
        #print(full_text)
        break

Gathering all the data and putting them in a dataframe.

In [14]:
df_list = []
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding = 'utf-8') as fh:
        title = fh.readline()[:-1]
        review_url = fh.readline()[:-1]
        full_text = fh.read()
        df_list.append({'title':title, 'review_url':review_url, 'full_text':full_text})

In [15]:
len(df_list)

88

In [17]:
df = pd.DataFrame(df_list, columns = ['title', 'review_url', 'full_text'])  # columns given so that they are arranged as desired
df.head()

Unnamed: 0,title,review_url,full_text
0,The Wizard of Oz (1939),http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...
1,Metropolis (1927),http://www.rogerebert.com/reviews/great-movie-...,The opening shots of the restored “Metropolis”...
2,Battleship Potemkin (1925),http://www.rogerebert.com/reviews/great-movie-...,"""The Battleship Potemkin” has been so famous f..."
3,E.T. The Extra-Terrestrial (1982),http://www.rogerebert.com/reviews/great-movie-...,Dear Raven and Emil:\n\nSunday we sat on the b...
4,Modern Times (1936),http://www.rogerebert.com/reviews/modern-times...,"A lot of movies are said to be timeless, but s..."
