# Read in a file

Pandas needs data to work with, and the most common way to get it is to read it in the form of a csv file.

Pandas will attempt to deal with most common csv file stuff automatically, so you can just point it to a csv file like the example below and it should just work. Pandas automatically treats the first line as a header, which is how this file is setup.

In [50]:
from collections import namedtuple
import numpy as np
import pandas as pd

# by default pandas treats the first line of data as the column names. use `header=None` if no names
fandango = pd.read_csv("data/fandango_score_comparison.csv") #pandas attempts to parse it best it can
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


In [5]:
fandango["FILM"][:2].values

array(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)'],
      dtype=object)

# A small dataset
A small data set with no initial column headers

In [94]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel','bob', 'Mary']
births = [968, 155, 77, 578, 973, 100,55]
BabyDataSet = list(zip(names,births))
BabyDataSet

[('Bob', 968),
 ('Jessica', 155),
 ('Mary', 77),
 ('John', 578),
 ('Mel', 973),
 ('bob', 100),
 ('Mary', 55)]

In [143]:
df = pd.DataFrame(data=BabyDataSet)
df

Unnamed: 0,0,1
0,Bob,968
1,Jessica,155
2,Mary,77
3,John,578
4,Mel,973
5,bob,100
6,Mary,55


In [144]:
df.columns=['Names', 'Birth_Rates']
df

Unnamed: 0,Names,Birth_Rates
0,Bob,968
1,Jessica,155
2,Mary,77
3,John,578
4,Mel,973
5,bob,100
6,Mary,55


In [135]:
df['Names']

0        Bob
1    Jessica
2       Mary
3       John
4        Mel
5        bob
6       Mary
Name: Names, dtype: object

In [98]:
name_list = [str(name).lower() for name in df['Names']]
print(name_list)
name_string = " ".join(name_list)
name_string

['bob', 'jessica', 'mary', 'john', 'mel', 'bob', 'mary']


'bob jessica mary john mel bob mary'

In [99]:
names = name_string.split()
names

['bob', 'jessica', 'mary', 'john', 'mel', 'bob', 'mary']

In [103]:
from collections import Counter
c = Counter(names)
d = c.most_common(3) #makes a list of tuples with the 3 most common names
d

[('mary', 2), ('bob', 2), ('john', 1)]

Making a list of common words

In [88]:
n = []
for i in d:
    n.append(i[0])
n

['mary', 'bob', 'john']

In [132]:
df

0                       BobBobBobBobBobBobBobBobBobBob
1    JessicaJessicaJessicaJessicaJessicaJessicaJess...
2             MaryMaryMaryMaryMaryMaryMaryMaryMaryMary
3             JohnJohnJohnJohnJohnJohnJohnJohnJohnJohn
4                       MelMelMelMelMelMelMelMelMelMel
5                       bobbobbobbobbobbobbobbobbobbob
6             MaryMaryMaryMaryMaryMaryMaryMaryMaryMary
Name: Names, dtype: object

In [109]:
a = df['Names'].value_counts()
a

Mary       2
John       1
Jessica    1
Mel        1
bob        1
Bob        1
Name: Names, dtype: int64

In [113]:
a[:3]

Mary       2
John       1
Jessica    1
Name: Names, dtype: int64

In [118]:
for it,row in a[:2].items():
    print(it,row)

Mary 2
John 1


In [145]:
df['Names'].apply(lambda x: x*2)

0            BobBob
1    JessicaJessica
2          MaryMary
3          JohnJohn
4            MelMel
5            bobbob
6          MaryMary
Name: Names, dtype: object

In [147]:
df

Unnamed: 0,Names,Birth_Rates
0,Bob,968
1,Jessica,155
2,Mary,77
3,John,578
4,Mel,973
5,bob,100
6,Mary,55


In [148]:
df['Birth_Rates'] = df['Birth_Rates'].apply(lambda x: x/10)