# Pandas Foundations

## Dissecting the anatomy of a DataFrame

In [2]:
import pandas as pd
import numpy as np

With this function, Pandas reads a csv file and converts it into a DataFrame

In [3]:
movie = pd.read_csv('data/movies.csv')
movie.head()

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.09,68,$19.62,2010.0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
3,When in Rome,Comedy,Disney,44,0.0,15,$43.04,2010.0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0


## Accessing the main DataFrame components

In [8]:
columns = movie.columns
index = movie.index
data = movie.values

In [16]:
columns

Index(['Film', 'Genre', 'Lead Studio', 'Audience score %', 'Profitability',
       'Rotten Tomatoes %', 'Worldwide Gross', 'Year'],
      dtype='object')

In [17]:
index

RangeIndex(start=0, stop=77, step=1)

In [20]:
data

array([['Zack and Miri Make a Porno', 'Romance', 'The Weinstein Company',
        70, 1.747541667, 64, '$41.94 ', 2008.0],
       ['Youth in Revolt', 'Comedy', 'The Weinstein Company', 52, 1.09,
        68, '$19.62 ', 2010.0],
       ['You Will Meet a Tall Dark Stranger', 'Comedy', 'Independent',
        35, 1.211818182, 43, '$26.66 ', 2010.0],
       ['When in Rome', 'Comedy', 'Disney', 44, 0.0, 15, '$43.04 ',
        2010.0],
       ['What Happens in Vegas', 'Comedy', 'Fox', 72, 6.267647029, 28,
        '$219.37 ', 2008.0],
       ['Water For Elephants', 'Drama', '20th Century Fox', 72,
        3.081421053, 60, '$117.09 ', 2011.0],
       ['WALL-E', 'Animation', 'Disney', 89, 2.8960190669999997, 96,
        '$521.28 ', 2008.0],
       ['Waitress', 'Romance', 'Independent', 67, 11.089741499999999, 89,
        '$22.18 ', 2007.0],
       ['Waiting For Forever', 'Romance', 'Independent', 53, 0.005, 6,
        '$0.03 ', 2011.0],
       ["Valentine's Day", 'Comedy', 'Warner Bros.', 54, 4.1

**type()** gives the type of a variable

In [22]:
type(index)

pandas.core.indexes.range.RangeIndex

In [23]:
type(columns)

pandas.core.indexes.base.Index

In [24]:
type(data)

numpy.ndarray

 **issubclass(first, second)** checks whether the first argument is indeed a subclass of the second argument

In [25]:
issubclass(pd.RangeIndex, pd.Index)

True

## Understanding data types

In [26]:
movie = pd.read_csv('data/movies.csv')

The attribute **dtypes** display each column along with its data type

In [27]:
movie.dtypes

Film                  object
Genre                 object
Lead Studio           object
Audience score %       int64
Profitability        float64
Rotten Tomatoes %      int64
Worldwide Gross       object
Year                 float64
dtype: object

The method **get_dtype_counts()** gives the count of all datatype present in our DataFrame

In [28]:
movie.get_dtype_counts()

float64    2
int64      2
object     4
dtype: int64

## Series

There are two different syntaxes to select a Series:

- *Index operator*

- *Dot notation*. However this notation **should be avoided** since could fail in some cases such as having column names which collide with DataFrame methods or having column names with spaces or with special characters.

A **Series** is a single column of data from a DataFrame. It is a **single dimenson** of data, composed of just an index and the data.

### Selecting a single column of data as a Series

Using **index operator**

In [30]:
movie['Film']

0              Zack and Miri Make a Porno
1                         Youth in Revolt
2      You Will Meet a Tall Dark Stranger
3                            When in Rome
4                   What Happens in Vegas
5                     Water For Elephants
6                                  WALL-E
7                                Waitress
8                     Waiting For Forever
9                         Valentine's Day
10    Tyler Perry's Why Did I get Married
11                Twilight: Breaking Dawn
12                               Twilight
13                         The Ugly Truth
14            The Twilight Saga: New Moon
15               The Time Traveler's Wife
16                           The Proposal
17                 The Invention of Lying
18                     The Heartbreak Kid
19                            The Duchess
20    The Curious Case of Benjamin Button
21                       The Back-up Plan
22                                Tangled
23                     Something B

Using **dot notation**

In [31]:
movie.Film

0              Zack and Miri Make a Porno
1                         Youth in Revolt
2      You Will Meet a Tall Dark Stranger
3                            When in Rome
4                   What Happens in Vegas
5                     Water For Elephants
6                                  WALL-E
7                                Waitress
8                     Waiting For Forever
9                         Valentine's Day
10    Tyler Perry's Why Did I get Married
11                Twilight: Breaking Dawn
12                               Twilight
13                         The Ugly Truth
14            The Twilight Saga: New Moon
15               The Time Traveler's Wife
16                           The Proposal
17                 The Invention of Lying
18                     The Heartbreak Kid
19                            The Duchess
20    The Curious Case of Benjamin Button
21                       The Back-up Plan
22                                Tangled
23                     Something B

In [32]:
type(movie['Film'])

pandas.core.series.Series

In [33]:
type(movie.Film)

pandas.core.series.Series

Now it is possible to convert the Series into a one column DataFrame using the method **to_frame()** 

In [34]:
film = movie['Film']
film.to_frame()

Unnamed: 0,Film
0,Zack and Miri Make a Porno
1,Youth in Revolt
2,You Will Meet a Tall Dark Stranger
3,When in Rome
4,What Happens in Vegas
5,Water For Elephants
6,WALL-E
7,Waitress
8,Waiting For Forever
9,Valentine's Day


## Calling Series methods

Calculate the total amount of attributes and methods of a certain object

In [5]:
series_attr_methods = set(dir(pd.Series))
len(series_attr_methods)

464

In [4]:
dataframe_attr_methods = set(dir(pd.DataFrame))
len(dataframe_attr_methods)

460

Find number of attributes and methods common to both Series and DataFrame

In [5]:
len(series_attr_methods & dataframe_attr_methods)

399

In [53]:
movie = pd.read_csv('data/movies.csv')

In [10]:
movie.dtypes

Film                  object
Genre                 object
Lead Studio           object
Audience score %       int64
Profitability        float64
Rotten Tomatoes %      int64
Worldwide Gross       object
Year                 float64
dtype: object

In [11]:
film = movie['Film']
profitability = movie['Profitability']

In [18]:
film.head()

0            Zack and Miri Make a Porno
1                       Youth in Revolt
2    You Will Meet a Tall Dark Stranger
3                          When in Rome
4                 What Happens in Vegas
Name: Film, dtype: object

In [19]:
profitability.head()

0    1.747542
1    1.090000
2    1.211818
3    0.000000
4    6.267647
Name: Profitability, dtype: float64

The dtype of the series usually determines which of the methods will be the most useful.
For instace, one of the most useful method for the **object** datatype Series is **value_counts()** which counts all the occurrences of each unique value.

In [20]:
film.value_counts()

Gnomeo and Juliet                      2
Mamma Mia!                             2
One Day                                1
Midnight in Paris                      1
Over Her Dead Body                     1
Tangled                                1
Sex and the City                       1
Not Easily Broken                      1
Life as We Know It                     1
What Happens in Vegas                  1
Knocked Up                             1
License to Wed                         1
Marley and Me                          1
Love Happens                           1
The Proposal                           1
Our Family Wedding                     1
Waitress                               1
It's Complicated                       1
Ghosts of Girlfriends Past             1
Remember Me                            1
A Dangerous Method                     1
Twilight                               1
Twilight: Breaking Dawn                1
A Serious Man                          1
I Love You Phill

The value_counts() is typically more useful for Series with object datatype, but it can be used for numeric Series as well. But need to pay attention since the numerica values could be rounded to the nearest integer value.

In [21]:
profitability.value_counts()

0.000000     3
5.387972     2
2.883500     2
9.234454     2
14.196400    1
1.090000     1
1.783944     1
3.491250     1
2.642353     1
1.747542     1
3.647411     1
2.440500     1
3.746782     1
1.719514     1
1.211818     1
4.184038     1
2.129444     1
4.471875     1
1.817667     1
5.402632     1
3.307180     1
0.825800     1
3.724192     1
2.896019     1
1.797417     1
4.598800     1
1.751351     1
0.005000     1
7.867500     1
2.367685     1
            ..
10.180027    1
2.639333     1
11.089741    1
4.005737     1
8.096000     1
3.352729     1
1.384167     1
2.004444     1
5.103117     1
2.202571     1
8.744706     1
2.022925     1
6.636402     1
1.715263     1
0.252895     1
2.598205     1
22.913136    1
1.314062     1
66.934000    1
0.652603     1
2.536429     1
2.649068     1
1.980206     1
1.983200     1
1.340000     1
6.267647     1
3.207850     1
3.081421     1
1.382800     1
5.343622     1
Name: Profitability, Length: 72, dtype: int64

### Return the size of a Series

In [22]:
film.size

77

In [23]:
film.shape

(77,)

In [24]:
len(film)

77

**count()** returns the number of non missing values 

In [25]:
film.count()

77

In [26]:
movie['Year'].count()

76

In [27]:
profitability.count()

77

To return a **tuple**, it can be write different expressions separeted by a **comma**

In [33]:
profitability.min(), profitability.max(), profitability.mean(), profitability.median(), profitability.std(), \
profitability.sum()

(0.0,
 66.934,
 4.5994833979610386,
 2.642352941,
 8.031990152409822,
 354.160221643)

**describe()** method can be used to output the same statistic informations

In [34]:
profitability.describe()

count    77.000000
mean      4.599483
std       8.031990
min       0.000000
25%       1.751351
50%       2.642353
75%       5.103117
max      66.934000
Name: Profitability, dtype: float64

**isnull()** determines if each value is missing or not

In [36]:
movie['Year'].isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64    False
65    False
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76     True
Name: Year, Length: 77, dtype: bool

It it possible to replace all missing values within a Series with the **fillna(0)** method

In [55]:
year = movie['Year'].fillna(0)
year.count()

77

On the other hand, to remove all the Series with missing values use **dropna()** method

In [56]:
year = movie['Year'].dropna()
year.size

77