# Pythonic Data Cleaning With pandas and NumPy


## prep

In [209]:
from google.cloud import storage
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
from io import StringIO
from io import BytesIO
import json
import pickle
import six
import charset_normalizer
from wordcloud import WordCloud 
from typing import List


sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None

### Dropping Columns in a DataFrame


Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

pandas provides a handy way of removing unwanted columns or rows from a DataFrame with the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

### storage init

In [210]:
storage_client =  storage.Client.from_service_account_json('heidless-jupyter-0-d2008100d98c.json')

BUCKET_NAME = 'python-data-cleaning-0'

bucket = storage_client.get_bucket(BUCKET_NAME)

#file_names = list(bucket.list_blobs(prefix=''))
#for name in file_names:
#    print(name.name)

### read file(s)

In [211]:

AllCSV = []
my_prefix = 'data-set-0/'
my_file = 'BL-Flickr-Images-Book.csv'
full_file = my_prefix + my_file
print(f'full_file: {full_file}')

file_names = list(bucket.list_blobs(prefix=my_prefix))
for file in file_names:
    if(file.name != my_prefix):
        if file.name == full_file:
            AllCSV.append(file.name)
            print(file.name)
AllCSV


full_file: data-set-0/BL-Flickr-Images-Book.csv
data-set-0/BL-Flickr-Images-Book.csv


['data-set-0/BL-Flickr-Images-Book.csv']

In [212]:
all_dataframes = []
#file_name = f'json/CA_category_id.json'

for csv in AllCSV:
    blob = bucket.get_blob(csv)
    if blob is not None and blob.exists(storage_client):
        bt = blob.download_as_string()
        s = str(bt, 'ISO-8859-1')
        s = StringIO(s)
        df = pd.read_csv(s, encoding='ISO-8859-1')
        #df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
        all_dataframes.append(df)
        print(csv)
    
all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets

data-set-0/BL-Flickr-Images-Book.csv


Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of âAll for ...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [213]:
to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']

df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

In [214]:
all_dataframes[0].head(3)

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of âAll for ...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...


Alternatively, we could also remove the columns by passing them to the columns parameter directly instead of separately specifying the labels to be removed and the axis where pandas should look for the labels:

df.drop(columns=to_drop, inplace=True)

### Changing the Index of a DataFrame

A pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

In [215]:
df['Identifier'].is_unique

True

In [216]:
print('BEFORE setting index')
df.head(3)

BEFORE setting index


Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of âAll for ...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...


In [217]:


#df = df.set_index('Identifier')
df.set_index('Identifier', inplace=True)
print('AFTER setting index')
df.head(3)


AFTER setting index


Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of âAll for ...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...


We can access each record in a straightforward way with `loc[]`. Although loc[] may not have all that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or record without regard to its position:

In [218]:
df.loc[206]

Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

In other words, 206 is the first label of the index. To access it by position, we could use df.iloc[0], which does position-based indexing.

In [219]:
df.iloc[2]

Place of Publication                                               London
Date of Publication                                                  1869
Publisher                                           Bradbury, Evans & Co.
Title                   Love the Avenger. By the author of âAll for ...
Author                                                          A., A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 218, dtype: object

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

## Tidying up Fields in the Data

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

In [220]:
df.dtypes.value_counts()

object    6
Name: count, dtype: int64

In [221]:
df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

- Remove the extra dates in square brackets, wherever present: 1879 [1878]
- Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
- Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
- Convert the string nan to NumPy’s NaN value
Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

regex = r'^(\d{4})'

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

Let’s see what happens when we run this regex across our dataset:

In [222]:
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head()

Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

In [223]:
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype

dtype('float64')

In [224]:
df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

In [225]:
df.loc[2836:, 'Date of Publication'].head(10)

Identifier
2836    1897.0
2854    1865.0
2956    1860.0
2957    1873.0
3017    1866.0
3131    1899.0
4598    1814.0
4884    1820.0
4976    1800.0
5382    1847.0
Name: Date of Publication, dtype: float64

## Combining str Methods with NumPy to Clean Columns

Above, you may have noticed the use of df['Date of Publication'].str. This attribute is a way to access speedy string operations in pandas that largely mimic operations on native Python strings or compiled regular expressions, such as .split(), .replace(), and .capitalize().

To clean the Place of Publication field, we can combine pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax:

np.where(condition, then, else)

Here, condition is either an array-like object or a Boolean mask. then is the value to be used if condition evaluates to True, and else is the value to be used otherwise.

Essentially, .where() takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies.

It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions:

We’ll be making use of these two functions to clean Place of Publication since this column has string objects. Here are the contents of the column:

In [226]:
df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

Let’s take a look at two specific entries:

In [227]:
df.loc[4157862]


Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                1867.0
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [228]:
df.loc[4159587]


Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                1834.0
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                              Mackenzie, E. (Eneas)
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not.

To clean this column in one sweep, we can use str.contains() to get a Boolean mask.

We clean the column as follows:

In [229]:
df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

oxford = pub.str.contains('Oxford')

newcastle = pub.str.contains('Newcastle')

df['Place of Publication'] = np.where(newcastle, 'Newcastle', pub.str.replace('-', ' '))

#df['Place of Publication'] = pub.str.replace('-', ' ')

df.loc[4157862]








Place of Publication                                            Newcastle
Date of Publication                                                1867.0
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

Here, the np.where function is called in a nested structure, with condition being a Series of Booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).

The replacement to be used is a string representing our desired place of publication. We also replace hyphens with a space with str.replace() and reassign to the column in our DataFrame.

Although there is more dirty data in this dataset, we will discuss only these two columns for now.

Let’s have a look at the first five entries, which look a lot crisper than when we started out:

In [230]:
df.head()


Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868.0,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,"Bradbury, Evans & Co.",Love the Avenger. By the author of âAll for ...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


## Cleaning the Entire Dataset Using the applymap Function

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

Let’s look at an example. We will create a DataFrame out of the “university_towns.txt” file:

In [231]:

AllJSON = []

my_prefix = 'data-set-0/'
my_file = 'university_towns.txt'
full_file = my_prefix + my_file

file_names = list(bucket.list_blobs(prefix=my_prefix))
for file in file_names:
    if(file.name != my_prefix):
        if file.name == full_file:
            AllJSON.append(file.name)
            
AllJSON


['data-set-0/university_towns.txt']

In [240]:
university_towns = []

file_name = f'data-set-0/university_towns.txt'
blob = bucket.get_blob(file_name)
if blob is not None and blob.exists(storage_client):
    data = blob.download_as_string()
    data1 = data.decode("utf-8")
    df_town = pd.read_csv(data1, sep="\s", engine='python')    
    

df_town

#for line in data1:
#    print(line)
    
#for line in data1:
 #   print(line)

OSError: [Errno 36] File name too long: 'Alabama[edit]\nAuburn (Auburn University)[1]\nFlorence (University of North Alabama)\nJacksonville (Jacksonville State University)[2]\nLivingston (University of West Alabama)[2]\nMontevallo (University of Montevallo)[2]\nTroy (Troy University)[2]\nTuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]\nTuskegee (Tuskegee University)[5]\nAlaska[edit]\nFairbanks (University of Alaska Fairbanks)[2]\nArizona[edit]\nFlagstaff (Northern Arizona University)[6]\nTempe (Arizona State University)\nTucson (University of Arizona)\nArkansas[edit]\nArkadelphia (Henderson State University, Ouachita Baptist University)[2]\nConway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]\nFayetteville (University of Arkansas)[7]\nJonesboro (Arkansas State University)[8]\nMagnolia (Southern Arkansas University)[2]\nMonticello (University of Arkansas at Monticello)[2]\nRussellville (Arkansas Tech University)[2]\nSearcy (Harding University)[5]\nCalifornia[edit]\nAngwin (Pacific Union College)[2]\nArcata (Humboldt State University)[5]\nBerkeley (University of California, Berkeley)[5]\nChico (California State University, Chico)[2]\nClaremont (Claremont McKenna College, Pomona College, Harvey Mudd College, Scripps College, Pitzer College, Keck Graduate Institute, Claremont Graduate University)[5]\nCotati (California State University, Sonoma)[2]\nDavis (University of California, Davis)[1]\nIrvine (University of California, Irvine)\nIsla Vista (University of California, Santa Barbara)[2]\nUniversity Park, Los Angeles (University of Southern California)\nMerced (University of California, Merced)\nOrange (Chapman University)\nPalo Alto (Stanford University)\nPomona (Cal Poly Pomona, WesternU)[9][10][11] and formerly Pomona College\nRedlands (University of Redlands)\nRiverside (University of California, Riverside, California Baptist University, La Sierra University)\nSacramento (California State University, Sacramento)\nUniversity District, San Bernardino (California State University, San Bernardino, American Sports University)\nSan Diego (University of California, San Diego, San Diego State University)\nSan Luis Obispo (California Polytechnic State University)[2]\nSanta Barbara (Fielding Graduate University, Santa Barbara City College, University of California, Santa Barbara, Westmont College)[2]\nSanta Cruz (University of California, Santa Cruz)[2]\nTurlock (California State University, Stanislaus)\nWestwood, Los Angeles (University of California, Los Angeles)[2]\nWhittier (Whittier CollegeRio Hondo College)\nColorado[edit]\nAlamosa (Adams State College)[2]\nBoulder (University of Colorado at Boulder)[12]\nDurango (Fort Lewis College)[2]\nFort Collins (Colorado State University)[13]\nGolden (Colorado School of Mines)\nGrand Junction (Colorado Mesa University)\nGreeley (University of Northern Colorado)\nGunnison (Western State College)[2]\nPueblo, Colorado (Colorado State University-Pueblo)\nConnecticut[edit]\nFairfield (Fairfield University, Sacred Heart University)\nMiddletown (Wesleyan University)\nNew Britain (Central Connecticut State University)\nNew Haven (Yale University, University of New Haven, Southern Connecticut State University, Albertus Magnus College, Quinnipiac University)[14]\nNew London (Connecticut College, US Coast Guard Academy, Mitchell College)[2]\nStorrs (University of Connecticut)[2]\nWillimantic (Eastern Connecticut State University)[2]\nDelaware[edit]\nDover (Delaware State University)[1]\nNewark (University of Delaware)[1]\nFlorida[edit]\nAve Maria (Ave Maria University)\nBoca Raton (Florida Atlantic University)\nCoral Gables (University of Miami)\nDeLand (Stetson University)[5]\nEstero (Florida Gulf Coast University)\nGainesville (University of Florida, Santa Fe College)\nOrlando (University of Central Florida)\nSarasota (New College of Florida, Ringling College of Art and Design, State College of Florida, Manatee-Sarasota, University of South Florida Sarasota-Manatee)\nSt. Augustine (Flagler College)\nSt. Leo (St. Leo University)\nTallahassee (Florida State University, Florida A&M University)\nTampa (University of South Florida)\nGeorgia[edit]\nAlbany (Albany State University)\nAthens (University of Georgia)[15]\nAtlanta (Georgia State University, Georgia Tech, Emory)[2]\nCarrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]\nDemorest (Piedmont College)[2]\nFort Valley (Fort Valley State University)[2]\nKennesaw (Kennesaw State University)\nMilledgeville (Georgia College & State University)[2]\nMount Vernon (Brewton-Parker College)[2]\nOxford (Oxford College)\nRome (Berry College, Shorter University)\nSavannah (Armstrong Atlantic State University, Savannah State University, Savannah College of Art and Design)\nStatesboro (Georgia Southern University)[5]\nValdosta (Valdosta State University)[2]\nWaleska (Reinhardt College)[2]\nYoung Harris (Young Harris College)[2]\nHawaii[edit]\nManoa (University of Hawaii at Manoa)[2]\nIdaho[edit]\nMoscow (University of Idaho)[2]\nPocatello (Idaho State University)[2]\nRexburg (BYU-Idaho)[2]\nIllinois[edit]\nCarbondale (Southern Illinois University Carbondale)[5]\nChampaign–Urbana (University of Illinois at Urbana–Champaign)[5]\nCharleston (Eastern Illinois University)[2]\nDeKalb (Northern Illinois University)[2]\nEdwardsville (Southern Illinois University Edwardsville)[2]\nEvanston (Northwestern University)[2]\nLebanon (McKendree University)[2]\nMacomb (Western Illinois University)[2]\nNormal (Illinois State University)[2]\nPeoria (Bradley University)\nIndiana[edit]\nBloomington (Indiana University Bloomington)[5]\nCrawfordsville (Wabash College)\nGreencastle (DePauw University)[5]\nHanover (Hanover College)[2]\nMarion (Indiana Wesleyan University)[2]\nMuncie (Ball State University)[2]\nOakland City (Oakland City University)[2]\nRichmond (Earlham College)[2]\nSouth Bend (Notre Dame University[2])\nTerre Haute (Indiana State University, Rose-Hulman Institute of Technology)[2]\nUpland (Taylor University)[2]\nValparaiso (Valparaiso University)\nWest Lafayette (Purdue University)[2]\nIowa[edit]\nAmes (Iowa State University)[2]\nCedar Falls (University of Northern Iowa)[2]\nCedar Rapids, Iowa (Coe College )\nDecorah (Luther College)[5]\nFayette (Upper Iowa University)[2]\nGrinnell (Grinnell College)[15]\nIowa City (University of Iowa)[15]\nLamoni (Graceland University)[2]\nMount Vernon, (Cornell College)\nOrange City (Northwestern College)[2]\nSioux Center (Dordt College)[2]\nStorm Lake (Buena Vista University)[2]\nWaverly (Wartburg College)[2]\nKansas[edit]\nBaldwin City (Baker University)[5]\nEmporia (Emporia State University)[2]\nHays (Fort Hays State University)[2]\nLawrence (University of Kansas, Haskell Indian Nations University)[15]\nManhattan (Kansas State University, Manhattan Christian College)[15]\nPittsburg (Pittsburg State University)[2]\nKentucky[edit]\nBowling Green (Western Kentucky University)[2]\nColumbia (Lindsey Wilson College)[2]\nGeorgetown (Georgetown College)\nHighland Heights (Northern Kentucky University)\nLexington (University of Kentucky, Transylvania University[5]\nLouisville (University of Louisville)\nMorehead (Morehead State University)[2]\nMurray (Murray State University)[5]\nRichmond (Eastern Kentucky University)[2]\nWilliamsburg (University of the Cumberlands)[2]\nWilmore (Asbury University, Asbury Theological Seminary)[2]\nLouisiana[edit]\nBaton Rouge (Louisiana State University, Southern University)\nGrambling (Grambling State University)[5]\nHammond (Southeastern Louisiana University)[2]\nLafayette (University of Louisiana at Lafayette)\nMonroe (University of Louisiana at Monroe)[2]\nNatchitoches (Northwestern State University)[2]\nRuston (Louisiana Tech University)[2]\nThibodaux (Nicholls State University)[2]\nMaine[edit]\nAugusta (University of Maine at Augusta)[2]\nBar Harbor (College of the Atlantic)\nBrunswick (Bowdoin College)\nFarmington (University of Maine at Farmington)[2]\nFort Kent (University of Maine at Fort Kent)\nGorham (University of Southern Maine)[2]\nLewiston, Maine (Bates College)\nOrono (University of Maine)[2]\nWaterville (Thomas College, Colby College)\nMaryland[edit]\nAnnapolis (United States Naval Academy, St. John\'s College)\nChestertown (Washington College)[2]\nCollege Park (University of Maryland, College Park)[16]\nCumberland (Allegany College of Maryland)\nEmmitsburg (Mount St. Mary\'s University)[2]\nFrostburg (Frostburg State University)[5]\nPrincess Anne (University of Maryland Eastern Shore)[5]\nTowson (Towson University, Goucher College)[2]\nSalisbury (Salisbury University)[2]\nWestminster (McDaniel College)\nMassachusetts[edit]\nBoston (Boston University, Boston College, Boston Conservatory, New England Conservatory, Brandeis University, Northeastern University, UMass Boston, Emmanuel College, Bunker Hill Community College, Roxbury Community College, Suffolk University, Simmons College, among many others)\nBridgewater (Bridgewater State College)[2]\nCambridge (Harvard University, Massachusetts Institute of Technology)(Lesley University, Cambridge College, Longy School of Music)[15]\nChestnut Hill (Boston College)\nThe Colleges of Worcester Consortium:\nDudley (Nichols College)\nNorth Grafton (Cummings School of Veterinary Medicine at Tufts University)\nPaxton (Anna Maria College)\nWorcester (Assumption, Becker, Clark University, Holy Cross, Mass. College of Pharmacy & Health Sciences, Quinsigamond Community College, UMass Medical School, Worcester State University, Worcester Polytechnic Institute)\nThe Five College Region of Western Massachusetts:\nAmherst (Amherst College, Hampshire College, University of Massachusetts Amherst)[15]\nNorthampton (Smith College)\nSouth Hadley (Mount Holyoke College)\nFitchburg (Fitchburg State College)\nNorth Adams (Massachusetts College of Liberal Arts)\nSpringfield (American International College), (Springfield College), and (Western New England College)\nWaltham (Bentley University), (Brandeis University)\nWilliamstown (Williams College)\nFramingham (Framingham State University)\nMichigan[edit]\nAdrian (Adrian College, Siena Heights University)\nAlbion (Albion College)[17]\nAllendale (Grand Valley State University)\nAlma (Alma College)\nAnn Arbor (University of Michigan)[1]\nBerrien Springs (Andrews University)[2]\nBig Rapids (Ferris State University)[2]\nEast Lansing (Michigan State University)[2]\nFlint (Kettering University, University of Michigan-Flint)\nHillsdale (Hillsdale College)\nHoughton (Michigan Technological University)[5]\nKalamazoo (Western Michigan University, Kalamazoo College)[2]\nMarquette (Northern Michigan University)[2]\nMidland (Northwood University)\nMount Pleasant (Central Michigan University)[2]\nOlivet (Olivet College)[2]\nSaginaw (Saginaw Valley State University)\nSault Ste. Marie (Lake Superior State University)\nSpring Arbor (Spring Arbor University)[2]\nYpsilanti (Eastern Michigan University)[2]\nMinnesota[edit]\nBemidji (Bemidji State University)[2]\nCrookston (University of Minnesota Crookston)[2]\nDuluth (University of Minnesota Duluth, Lake Superior College, The College of St. Scholastica, University of Wisconsin–Superior, Duluth Business University\nFaribault, South Central College\nMankato (Minnesota State University, Mankato),[2] Bethany Lutheran College\nMarshall (Southwest Minnesota State University)[2]\nMoorhead (Minnesota State University, Moorhead, Concordia College)[18]\nMorris (University of Minnesota Morris)[2]\nNorthfield (Carleton College, St. Olaf College)[5]\nNorth Mankato, South Central College\nSt. Cloud (St. Cloud State University, The College of St. Scholastica)[2]\nSt. Joseph (College of Saint Benedict)[2]\nSt. Peter (Gustavus Adolphus College)[2]\nWinona (Winona State University, St. Mary\'s University of Minnesota)[19]\nMississippi[edit]\nCleveland (Delta State University)[2]\nHattiesburg (University of Southern Mississippi)[20]\nItta Bena (Mississippi Valley State University)[2]\nOxford (University of Mississippi)[2]\nStarkville (Mississippi State University)[2]\nMissouri[edit]\nBolivar (Southwest Baptist University)[2]\nCape Girardeau (Southeast Missouri State University)[2]\nColumbia (University of Missouri, Stephens College, Columbia College)[20]\nFayette (Central Methodist University)[2]\nFulton (Westminster College and William Woods University).\nKirksville (Truman State University, A. T. Still University)[2]\nMaryville (Northwest Missouri State University)[2]\nRolla (Missouri University of Science and Technology)[2]\nWarrensburg (University of Central Missouri)[5]\nMontana[edit]\nBozeman (Montana State University)[2]\nDillon (University of Montana Western)[2]\nMissoula (University of Montana)[5]\nNebraska[edit]\nChadron (Chadron State College)[5]\nCrete (Doane College)[2]\nKearney (University of Nebraska at Kearney)[2]\nLincoln (University of Nebraska at Lincoln)[5]\nPeru (Peru State College)[2]\nSeward (Concordia University)[2]\nWayne (Wayne State College)[2]\nNevada[edit]\nLas Vegas (University of Nevada, Las Vegas)\nReno (University of Nevada, Reno)\nNew Hampshire[edit]\nNew London, New Hampshire (Colby-Sawyer College)\nDurham (University of New Hampshire)[2]\nHanover (Dartmouth College)[5]\nHenniker (New England College)\nKeene (Keene State College)[2]\nPlymouth (Plymouth State University)[2]\nRindge (Franklin Pierce University)\nNew Jersey[edit]\nEwing (The College of New Jersey), (Rider University)\nJersey City (New Jersey City University), (Saint Peter\'s University)\nGlassboro (Rowan University)[2]\nHoboken (Stevens Institute of Technology)\nMadison (Drew University), (Fairleigh Dickinson University), (College of Saint Elizabeth)\nNewark (Rutgers University), (New Jersey Institute of Technology), (UMDNJ)\nNew Brunswick (Rutgers University)[5]\nPrinceton (Princeton University)[5]\nUnion (Kean University)\nWest Long Branch (Monmouth University)\nNew Mexico[edit]\nHobbs (University of the Southwest)[2]\nLas Cruces (New Mexico State University)[2]\nLas Vegas (New Mexico Highlands University)[2]\nPortales (Eastern New Mexico University)[2]\nSilver City (Western New Mexico University)[2]\nNew York[edit]\nAlfred (Alfred University, Alfred State College)[2]\nAlbany (SUNY Albany, Siena College, Albany College of Pharmacy, Albany Law School, Albany Medical College, College of Saint Rose, Excelsior College, Maria College of Albany, Mildred Elley, Sage College of Albany)\nAurora (Wells College)[21]\nBinghamton (Binghamton University)[2]\nBrockport (SUNY Brockport)[5]\nBuffalo (University at Buffalo)\nCanton (St. Lawrence University, SUNY Canton)[2]\nClinton (Hamilton College)[2]\nCobleskill (SUNY Cobleskill)[2]\nDelhi (SUNY Delhi)[2]\nFredonia (SUNY Fredonia)[2]\nGeneseo (SUNY Geneseo)[2]\nGeneva (Hobart and William Smith Colleges)\nHamilton (Colgate University)[2]\nIthaca (Cornell University, Ithaca College)[1]\nMorningside Heights, Manhattan (Columbia University, Barnard College, Teachers College, Manhattan School of Music, Jewish Theological Seminary, Union Theological Seminary, Bank Street College of Education)\nNew Paltz (SUNY New Paltz)[2]\nOneonta (SUNY Oneonta, Hartwick College)[2]\nOswego (SUNY Oswego)[2]\nPlattsburgh (SUNY Plattsburgh)[2]\nPotsdam (SUNY Potsdam, Clarkson University)[2]\nPoughkeepsie (Vassar College, Marist College)[2]\nPurchase (Purchase College, Manhattanville College)[2]\nRochester (University of Rochester, Rochester Institute of Technology, Nazareth College, St. John Fisher College, Monroe Community College, Roberts Wesleyan College, SUNY Brockport, SUNY Empire State College)[2]\nSaratoga Springs (Skidmore College)[2]\nSeneca Falls (New York Chiropractic College)\nStony Brook (Stony Brook University)\nSyracuse (Syracuse University, SUNY ESF, Upstate Medical University)\nTivoli (Bard College)\nTroy (Rensselaer Polytechnic Institute, Russell Sage College, Hudson Valley Community College)\nWest Point (United States Military Academy)\nNorth Carolina[edit]\nBanner Elk (Lees-McRae College)\nBoiling Springs (Gardner-Webb University)[2]\nBoone (Appalachian State University)[2]\nBuies Creek (Campbell University)[2]\nChapel Hill (University of North Carolina at Chapel Hill)[20]\nCullowhee (Western Carolina University)[2]\nDavidson (Davidson College)[5]\nDurham (Duke University, North Carolina Central University)[5]\nElon (Elon University)[2]\nGreensboro (University of North Carolina at Greensboro, Greensboro College, Guilford College, North Carolina A & T State University, Bennett College)\nGreenville (East Carolina University)[2]\nHickory (Lenoir-Rhyne University)[2]\nMars Hill (Mars Hill College)[2]\nMount Olive (Mount Olive College)[2]\nPembroke (University of North Carolina at Pembroke)[2]\nWilmington, North Carolina (University of North Carolina at Wilmington)\nWingate (Wingate University)[2]\nWinston-Salem (Wake Forest University, University of North Carolina School of the Arts, Salem College, Winston-Salem State University)\nNorth Dakota[edit]\nFargo (North Dakota State University)[18]\nGrand Forks (University of North Dakota)[5]\nOhio[edit]\nAda (Ohio Northern University)[2]\nAlliance (University of Mount Union)\nAshland (Ashland University)[2]\nAthens (Ohio University)[2]\nBerea (Baldwin Wallace College)\nBluffton (Bluffton University)[2]\nBowling Green (Bowling Green State University)[2]\nCedarville (Cedarville University)[2]\nColumbus (Ohio State University)\nDelaware (Ohio Wesleyan University)\nFairborn (Wright State University)\nFindlay (University of Findlay)\nGambier (Kenyon College)[2]\nGranville (Denison University)[2]\nHiram (Hiram College)[2]\nKent (Kent State University)[2]\nNelsonville (Hocking College)[2]\nNew Concord (Muskingum College)[2]\nOberlin (Oberlin College)[5]\nOxford (Miami University)[5]\nRio Grande (University of Rio Grande)[2]\nWilberforce (Wilberforce University, Central State University)[2]\nOklahoma[edit]\nAda (East Central University)[2]\nAlva (Northwestern Oklahoma State University)[2]\nDurant (Southeastern Oklahoma State University)[2]\nEdmond (University of Central Oklahoma, Oklahoma Christian University)[2]\nGoodwell (Oklahoma Panhandle State University)[2]\nLangston (Langston University)[5]\nNorman (University of Oklahoma)[1]\nStillwater (Oklahoma State University)[5]\nTahlequah (Northeastern State University)[2]\nTulsa (The University of Tulsa)\nWeatherford (Southwestern Oklahoma State University)\nOregon[edit]\nAshland (Southern Oregon University)[2]\nCorvallis (Oregon State University)[20]\nEugene (Lane Community College, Northwest Christian University, University of Oregon)[20]\nForest Grove (Pacific University)\nKlamath Falls (Klamath Community College, Oregon Institute of Technology)\nLa Grande (Eastern Oregon University)[2]\nMarylhurst (Marylhurst University)\nMcMinnville (Linfield College)\nMonmouth (Western Oregon University)[2]\nNewberg (George Fox University)\nPennsylvania[edit]\nAltoona (Penn State Altoona)\nAnnville (Lebanon Valley College)[2]\nBethlehem (Lehigh University, Moravian College)\nBloomsburg (Bloomsburg University of Pennsylvania)[2]\nBradford (University of Pittsburgh at Bradford)\nCalifornia (California University of Pennsylvania)[2]\nCarlisle (Dickinson College)\nCecil B. Moore, Philadelphia, also known as "Templetown" (Temple University)\nClarion (Clarion University of Pennsylvania)[2]\nCollegeville (Ursinus College)\nCresson (Mount Aloysius College)[2]\nEast Stroudsburg (East Stroudsburg University of Pennsylvania)[2]\nEdinboro (Edinboro University of Pennsylvania)[2]\nErie (Gannon University, Mercyhurst College, Penn State Erie)\nGettysburg (Gettysburg College)[2]\nGreensburg (Seton Hill University, University of Pittsburgh at Greensburg)\nGrove City (Grove City College)[2]\nHuntingdon (Juniata College)[2]\nIndiana (Indiana University of Pennsylvania)[2]\nJohnstown (University of Pittsburgh at Johnstown)\nKutztown (Kutztown University of Pennsylvania)[2]\nLancaster (Franklin & Marshall)\nLewisburg (Bucknell University)[5]\nLock Haven (Lock Haven University of Pennsylvania)[2]\nLoretto (St. Francis University)[2]\nMansfield (Mansfield University of Pennsylvania)[2]\nMeadville (Allegheny College)\nMont Alto (Penn State Mont Alto)\nMillersville (Millersville University of Pennsylvania)[2]\nNew Wilmington (Westminster College)[2]\nNorth East (Mercyhurst North East)\nUniversity City, Philadelphia (Drexel University, University of Pennsylvania, University of the Sciences in Philadelphia)\nOakland, Pittsburgh (Carnegie Mellon University, University of Pittsburgh, Carlow University)\nReading (Albright College, Alvernia University, Penn State Berks)\nSelinsgrove (Susquehanna University)[2]\nShippensburg (Shippensburg University of Pennsylvania)[2]\nSlippery Rock (Slippery Rock University of Pennsylvania)[2]\nState College (Pennsylvania State University)[22]\nVillanova (Villanova University)\nWaynesburg (Waynesburg University)\nWest Chester (West Chester University of Pennsylvania)\nWilkes-Barre (King\'s College, Wilkes University)\nWilliamsport (Lycoming College, Pennsylvania College of Technology)[2]\nRhode Island[edit]\nKingston (University of Rhode Island)[2]\nProvidence (Brown University, (University of Rhode Island), Rhode Island School of Design, Johnson and Wales University, Providence College, Community College of Rhode Island, Rhode Island College, and Roger Williams University.)\nSouth Carolina[edit]\nCentral (Southern Wesleyan University)[2]\nCharleston (College of Charleston, The Citadel, MUSC)\nClemson (Clemson University)[2]\nClinton (Presbyterian College)\nColumbia (University of South Carolina)[14]\nDue West (Erskine College)\nFlorence (Francis Marion University)\nGreenwood (Lander University)\nOrangeburg (South Carolina State University, Claflin University)[2]\nRock Hill (Winthrop University)\nSpartanburg (Wofford College, Converse College, University of South Carolina Upstate, Spartanburg Methodist College, Edward Via College of Osteopathic Medicine, Spartanburg Community College, Virginia College, Sherman College of Chiropractic)\nSouth Dakota[edit]\nBrookings (South Dakota State University)[2]\nMadison (Dakota State University)\nSpearfish (Black Hills State University)\nVermillion (University of South Dakota)[5]\nTennessee[edit]\nChattanooga (University of Tennessee at Chattanooga)\nCollegedale (Southern Adventist University)\nCookeville (Tennessee Technological University)[2]\nHarrogate (Lincoln Memorial University)[2]\nHenderson (Freed-Hardeman University)[2]\nJohnson City (East Tennessee State University)\nKnoxville (University of Tennessee)\nMartin (University of Tennessee at Martin)[2]\nMcKenzie (Bethel University)[2]\nMemphis (Christian Brothers University, LeMoyne-Owen College, Memphis College of Art, Memphis Theological Seminary, Rhodes College, Southern College of Optometry, Southwest Tennessee Community College, University of Memphis, University of Tennessee Health Science Center, Visible Music College)\nMurfreesboro (Middle Tennessee State University)[2]\nNashville (Vanderbilt University, Belmont University, Tennessee State University, Lipscomb University, Fisk University, Aquinas College, Trevecca Nazarene University)\nSewanee (Sewanee: the University of the South)[2]\nTexas[edit]\nAbilene (Abilene Christian University, Hardin-Simmons University, McMurry University)\nAlpine (Sul Ross State University)[2]\nAustin (University of Texas at Austin, St. Edwards University, Huston-Tillotson University)[2]\nBeaumont (Lamar University)\nCanyon (West Texas A&M University)[2]\nCollege Station (Texas A&M University)[5]\nCommerce (Texas A&M University–Commerce)[2]\nDallas (Southern Methodist University)\nDenton (University of North Texas, Texas Woman\'s University)[2]\nFort Worth (Texas Christian University, Texas Wesleyan University)\nGeorgetown (Southwestern University)\nHuntsville (Sam Houston State University)[2]\nHouston (University of Houston, Texas Southern University, Rice University, Houston Baptist University)\nKeene (Southwestern Adventist University)[2]\nKingsville (Texas A&M University–Kingsville)[2]\nLubbock (Texas Tech University, Lubbock Christian University)\nNacogdoches (Stephen F. Austin State University)[2]\nPlainview (Wayland Baptist University)[2]\nPrairie View (Prairie View A&M University)[2]\nSan Marcos (Texas State University)[5]\nStephenville (Tarleton State University)[2]\nWaco (Baylor University)\nUtah[edit]\nCedar City (Southern Utah University)[2]\nLogan (Utah State University)[2]\nProvo (Brigham Young University)[5]\nOrem (Utah Valley University)\nSalt Lake City (University of Utah)\nEphraim (Snow College)\nVermont[edit]\nBurlington (University of Vermont, Champlain College and Saint Michael\'s College)[2]\nCastleton (Castleton State College)[2]\nJohnson (Johnson State College)[2]\nLyndonville (Lyndon State College)[2]\nMiddlebury (Middlebury College)[2]\nNorthfield (Norwich University)[2]\nVirginia[edit]\nBlacksburg (Virginia Polytechnic Institute and State University)[5]\nBridgewater (Bridgewater College)[2]\nCharlottesville (University of Virginia)[23]\nFarmville (Longwood University, Hampden-Sydney College)[2]\nFredericksburg (University of Mary Washington)[2]\nHarrisonburg (James Madison University, Eastern Mennonite University)[2]\nLexington (Washington and Lee University, Virginia Military Institute)[2]\nLynchburg (Lynchburg College, Randolph College, Liberty University, Central Virginia Community College)\nRadford (Radford University)[2]\nWilliamsburg (The College of William & Mary)[2]\nWise (University of Virginia\'s College at Wise)[2]\nChesapeake (Averett University, DeVry University, Troy University, Tidewater Community College, Strayer University, Everest University, Sentera College of Health Sciences, St Leo University)[2]\nWashington[edit]\nBellingham (Western Washington University)\nCheney (Eastern Washington University)[2]\nEllensburg (Central Washington University)[5]\nPullman (Washington State University)[5]\nUniversity District, Seattle (University of Washington)[5]\nWest Virginia[edit]\nAthens (Concord University)[2]\nBuckhannon (West Virginia Wesleyan College)[2]\nFairmont (Fairmont State University)[2]\nGlenville (Glenville State College)[2]\nHuntington (Marshall University)[2]\nMontgomery (West Virginia University Institute of Technology)[2]\nMorgantown (West Virginia University)[2]\nShepherdstown (Shepherd University)[2]\nWest Liberty (West Liberty University)[2]\nWisconsin[edit]\nAppleton (Lawrence University)\nEau Claire (University of Wisconsin–Eau Claire)\nGreen Bay (University of Wisconsin-Green Bay)\nLa Crosse (University of Wisconsin–La Crosse, Western Technical College, Viterbo University)[2]\nMadison (University of Wisconsin–Madison)[2]\nMenomonie (University of Wisconsin–Stout)[2]\nMilwaukee (Marquette University, University of Wisconsin–Milwaukee)\nOshkosh (University of Wisconsin–Oshkosh)\nPlatteville (University of Wisconsin–Platteville)[2]\nRiver Falls (University of Wisconsin–River Falls)[2]\nStevens Point (University of Wisconsin–Stevens Point)[2]\nWaukesha (Carroll University)\nWhitewater (University of Wisconsin–Whitewater)[2]\nWyoming[edit]\nLaramie (University of Wyoming)[5]\n'

In [None]:
university_towns = []

blob  = bucket.get_blob(f'GBvideos.csv')


with open('Datasets/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))

university_towns[:5]

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/university_towns.txt'