# Pandas Foundations

## Dissecting the anatomy of a DataFrame

In [2]:
import pandas as pd
import numpy as np

With this function, Pandas reads a csv file and converts it into a DataFrame

In [3]:
movie = pd.read_csv('data/movies.csv')
movie.head()

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.09,68,$19.62,2010.0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
3,When in Rome,Comedy,Disney,44,0.0,15,$43.04,2010.0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0


## Accessing the main DataFrame components

In [8]:
columns = movie.columns
index = movie.index
data = movie.values

In [16]:
columns

Index(['Film', 'Genre', 'Lead Studio', 'Audience score %', 'Profitability',
       'Rotten Tomatoes %', 'Worldwide Gross', 'Year'],
      dtype='object')

In [17]:
index

RangeIndex(start=0, stop=77, step=1)

In [20]:
data

array([['Zack and Miri Make a Porno', 'Romance', 'The Weinstein Company',
        70, 1.747541667, 64, '$41.94 ', 2008.0],
       ['Youth in Revolt', 'Comedy', 'The Weinstein Company', 52, 1.09,
        68, '$19.62 ', 2010.0],
       ['You Will Meet a Tall Dark Stranger', 'Comedy', 'Independent',
        35, 1.211818182, 43, '$26.66 ', 2010.0],
       ['When in Rome', 'Comedy', 'Disney', 44, 0.0, 15, '$43.04 ',
        2010.0],
       ['What Happens in Vegas', 'Comedy', 'Fox', 72, 6.267647029, 28,
        '$219.37 ', 2008.0],
       ['Water For Elephants', 'Drama', '20th Century Fox', 72,
        3.081421053, 60, '$117.09 ', 2011.0],
       ['WALL-E', 'Animation', 'Disney', 89, 2.8960190669999997, 96,
        '$521.28 ', 2008.0],
       ['Waitress', 'Romance', 'Independent', 67, 11.089741499999999, 89,
        '$22.18 ', 2007.0],
       ['Waiting For Forever', 'Romance', 'Independent', 53, 0.005, 6,
        '$0.03 ', 2011.0],
       ["Valentine's Day", 'Comedy', 'Warner Bros.', 54, 4.1

**type()** gives the type of a variable

In [22]:
type(index)

pandas.core.indexes.range.RangeIndex

In [23]:
type(columns)

pandas.core.indexes.base.Index

In [24]:
type(data)

numpy.ndarray

 **issubclass(first, second)** checks whether the first argument is indeed a subclass of the second argument

In [25]:
issubclass(pd.RangeIndex, pd.Index)

True

## Understanding data types

In [26]:
movie = pd.read_csv('data/movies.csv')

The attribute **dtypes** display each column along with its data type

In [27]:
movie.dtypes

Film                  object
Genre                 object
Lead Studio           object
Audience score %       int64
Profitability        float64
Rotten Tomatoes %      int64
Worldwide Gross       object
Year                 float64
dtype: object

The method **get_dtype_counts()** gives the count of all datatype present in our DataFrame

In [28]:
movie.get_dtype_counts()

float64    2
int64      2
object     4
dtype: int64

## Series

There are two different syntaxes to select a Series:

- *Index operator*

- *Dot notation*. However this notation **should be avoided** since could fail in some cases such as having column names which collide with DataFrame methods or having column names with spaces or with special characters.

A **Series** is a single column of data from a DataFrame. It is a **single dimenson** of data, composed of just an index and the data.

### Selecting a single column of data as a Series

Using **index operator**

In [30]:
movie['Film']

0              Zack and Miri Make a Porno
1                         Youth in Revolt
2      You Will Meet a Tall Dark Stranger
3                            When in Rome
4                   What Happens in Vegas
5                     Water For Elephants
6                                  WALL-E
7                                Waitress
8                     Waiting For Forever
9                         Valentine's Day
10    Tyler Perry's Why Did I get Married
11                Twilight: Breaking Dawn
12                               Twilight
13                         The Ugly Truth
14            The Twilight Saga: New Moon
15               The Time Traveler's Wife
16                           The Proposal
17                 The Invention of Lying
18                     The Heartbreak Kid
19                            The Duchess
20    The Curious Case of Benjamin Button
21                       The Back-up Plan
22                                Tangled
23                     Something B

Using **dot notation**

In [31]:
movie.Film

0              Zack and Miri Make a Porno
1                         Youth in Revolt
2      You Will Meet a Tall Dark Stranger
3                            When in Rome
4                   What Happens in Vegas
5                     Water For Elephants
6                                  WALL-E
7                                Waitress
8                     Waiting For Forever
9                         Valentine's Day
10    Tyler Perry's Why Did I get Married
11                Twilight: Breaking Dawn
12                               Twilight
13                         The Ugly Truth
14            The Twilight Saga: New Moon
15               The Time Traveler's Wife
16                           The Proposal
17                 The Invention of Lying
18                     The Heartbreak Kid
19                            The Duchess
20    The Curious Case of Benjamin Button
21                       The Back-up Plan
22                                Tangled
23                     Something B

In [32]:
type(movie['Film'])

pandas.core.series.Series

In [33]:
type(movie.Film)

pandas.core.series.Series

Now it is possible to convert the Series into a one column DataFrame using the method **to_frame()** 

In [34]:
film = movie['Film']
film.to_frame()

Unnamed: 0,Film
0,Zack and Miri Make a Porno
1,Youth in Revolt
2,You Will Meet a Tall Dark Stranger
3,When in Rome
4,What Happens in Vegas
5,Water For Elephants
6,WALL-E
7,Waitress
8,Waiting For Forever
9,Valentine's Day


## Calling Series methods

Calculate the total amount of attributes and methods of a certain object

In [6]:
series_attr_methods = set(dir(pd.Series))
len(series_attr_methods)

464

In [4]:
dataframe_attr_methods = set(dir(pd.DataFrame))
len(dataframe_attr_methods)

460

Find number of attributes and methods common to both Series and DataFrame

In [5]:
len(series_attr_methods & dataframe_attr_methods)

399

In [53]:
movie = pd.read_csv('data/movies.csv')

In [10]:
movie.dtypes

Film                  object
Genre                 object
Lead Studio           object
Audience score %       int64
Profitability        float64
Rotten Tomatoes %      int64
Worldwide Gross       object
Year                 float64
dtype: object

In [11]:
film = movie['Film']
profitability = movie['Profitability']

In [18]:
film.head()

0            Zack and Miri Make a Porno
1                       Youth in Revolt
2    You Will Meet a Tall Dark Stranger
3                          When in Rome
4                 What Happens in Vegas
Name: Film, dtype: object

In [19]:
profitability.head()

0    1.747542
1    1.090000
2    1.211818
3    0.000000
4    6.267647
Name: Profitability, dtype: float64

The dtype of the Series usually determines which of the methods will be the most useful.
For instance, one of the most useful method for the **object** datatype Series is **value_counts()** which counts all the occurrences of each unique value.

In [20]:
film.value_counts()

Gnomeo and Juliet                      2
Mamma Mia!                             2
One Day                                1
Midnight in Paris                      1
Over Her Dead Body                     1
Tangled                                1
Sex and the City                       1
Not Easily Broken                      1
Life as We Know It                     1
What Happens in Vegas                  1
Knocked Up                             1
License to Wed                         1
Marley and Me                          1
Love Happens                           1
The Proposal                           1
Our Family Wedding                     1
Waitress                               1
It's Complicated                       1
Ghosts of Girlfriends Past             1
Remember Me                            1
A Dangerous Method                     1
Twilight                               1
Twilight: Breaking Dawn                1
A Serious Man                          1
I Love You Phill

The value_counts() is typically more useful for Series with object datatype, but it can be used for numeric Series as well. But need to pay attention since the numerical values could be rounded to the nearest integer value.

In [21]:
profitability.value_counts()

0.000000     3
5.387972     2
2.883500     2
9.234454     2
14.196400    1
1.090000     1
1.783944     1
3.491250     1
2.642353     1
1.747542     1
3.647411     1
2.440500     1
3.746782     1
1.719514     1
1.211818     1
4.184038     1
2.129444     1
4.471875     1
1.817667     1
5.402632     1
3.307180     1
0.825800     1
3.724192     1
2.896019     1
1.797417     1
4.598800     1
1.751351     1
0.005000     1
7.867500     1
2.367685     1
            ..
10.180027    1
2.639333     1
11.089741    1
4.005737     1
8.096000     1
3.352729     1
1.384167     1
2.004444     1
5.103117     1
2.202571     1
8.744706     1
2.022925     1
6.636402     1
1.715263     1
0.252895     1
2.598205     1
22.913136    1
1.314062     1
66.934000    1
0.652603     1
2.536429     1
2.649068     1
1.980206     1
1.983200     1
1.340000     1
6.267647     1
3.207850     1
3.081421     1
1.382800     1
5.343622     1
Name: Profitability, Length: 72, dtype: int64

### Return the size of a Series

In [22]:
film.size

77

In [23]:
film.shape

(77,)

In [24]:
len(film)

77

**count()** returns the number of non missing values 

In [25]:
film.count()

77

In [26]:
movie['Year'].count()

76

In [27]:
profitability.count()

77

To return a **tuple**, it can be write different expressions separeted by a **comma**

In [33]:
profitability.min(), profitability.max(), profitability.mean(), profitability.median(), profitability.std(), \
profitability.sum()

(0.0,
 66.934,
 4.5994833979610386,
 2.642352941,
 8.031990152409822,
 354.160221643)

**describe()** method can be used to output the same statistic informations

In [34]:
profitability.describe()

count    77.000000
mean      4.599483
std       8.031990
min       0.000000
25%       1.751351
50%       2.642353
75%       5.103117
max      66.934000
Name: Profitability, dtype: float64

**isnull()** determines if each value is missing or not

In [36]:
movie['Year'].isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64    False
65    False
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76     True
Name: Year, Length: 77, dtype: bool

It it possible to replace all missing values within a Series with the **fillna(0)** method

In [55]:
year = movie['Year'].fillna(0)
year.count()

77

On the other hand, to remove all the Series with missing values use **dropna()** method

In [8]:
year = movie['Year'].dropna()
year.size

76

## Working with operators on a Series

In [9]:
5+9

14

In [10]:
4**2

16

Concatenating strings

In [11]:
'abc' + 'def'

'abcdef'

In [12]:
not (5 <= 9)

False

In [21]:
7 in [1, 2, 3]

False

Use **&** operator to find common elements in two or more sets

In [27]:
set([1, 2, 3]) & set([2, 3, 4])

{2, 3}

In [35]:
movie = pd.read_csv('data/movies.csv')
year = movie['Year']
year

0     2008.0
1     2010.0
2     2010.0
3     2010.0
4     2008.0
5     2011.0
6     2008.0
7     2007.0
8     2011.0
9     2010.0
10    2007.0
11    2011.0
12    2008.0
13    2009.0
14    2009.0
15    2009.0
16    2009.0
17    2009.0
18    2007.0
19    2008.0
20    2008.0
21    2010.0
22    2010.0
23    2011.0
24    2010.0
25    2010.0
26    2010.0
27    2008.0
28    2010.0
29    2008.0
       ...  
47    2008.0
48    2009.0
49    2010.0
50    2010.0
51    2007.0
52    2010.0
53    2010.0
54    2007.0
55    2010.0
56    2010.0
57    2011.0
58    2009.0
59    2010.0
60    2008.0
61    2009.0
62    2007.0
63    2010.0
64    2011.0
65    2011.0
66    2009.0
67    2008.0
68    2008.0
69    2007.0
70    2010.0
71    2011.0
72    2007.0
73    2009.0
74    2011.0
75    2008.0
76       NaN
Name: Year, Length: 77, dtype: float64

In [36]:
year + 1

0     2009.0
1     2011.0
2     2011.0
3     2011.0
4     2009.0
5     2012.0
6     2009.0
7     2008.0
8     2012.0
9     2011.0
10    2008.0
11    2012.0
12    2009.0
13    2010.0
14    2010.0
15    2010.0
16    2010.0
17    2010.0
18    2008.0
19    2009.0
20    2009.0
21    2011.0
22    2011.0
23    2012.0
24    2011.0
25    2011.0
26    2011.0
27    2009.0
28    2011.0
29    2009.0
       ...  
47    2009.0
48    2010.0
49    2011.0
50    2011.0
51    2008.0
52    2011.0
53    2011.0
54    2008.0
55    2011.0
56    2011.0
57    2012.0
58    2010.0
59    2011.0
60    2009.0
61    2010.0
62    2008.0
63    2011.0
64    2012.0
65    2012.0
66    2010.0
67    2009.0
68    2009.0
69    2008.0
70    2011.0
71    2012.0
72    2008.0
73    2010.0
74    2012.0
75    2009.0
76       NaN
Name: Year, Length: 77, dtype: float64

In [37]:
year * 2.5

0     5020.0
1     5025.0
2     5025.0
3     5025.0
4     5020.0
5     5027.5
6     5020.0
7     5017.5
8     5027.5
9     5025.0
10    5017.5
11    5027.5
12    5020.0
13    5022.5
14    5022.5
15    5022.5
16    5022.5
17    5022.5
18    5017.5
19    5020.0
20    5020.0
21    5025.0
22    5025.0
23    5027.5
24    5025.0
25    5025.0
26    5025.0
27    5020.0
28    5025.0
29    5020.0
       ...  
47    5020.0
48    5022.5
49    5025.0
50    5025.0
51    5017.5
52    5025.0
53    5025.0
54    5017.5
55    5025.0
56    5025.0
57    5027.5
58    5022.5
59    5025.0
60    5020.0
61    5022.5
62    5017.5
63    5025.0
64    5027.5
65    5027.5
66    5022.5
67    5020.0
68    5020.0
69    5017.5
70    5025.0
71    5027.5
72    5017.5
73    5022.5
74    5027.5
75    5020.0
76       NaN
Name: Year, Length: 77, dtype: float64

**//** operator do a floating-point division (removes the decimals) whereas **/** do an integer division

In [42]:
year // 7

0     286.0
1     287.0
2     287.0
3     287.0
4     286.0
5     287.0
6     286.0
7     286.0
8     287.0
9     287.0
10    286.0
11    287.0
12    286.0
13    287.0
14    287.0
15    287.0
16    287.0
17    287.0
18    286.0
19    286.0
20    286.0
21    287.0
22    287.0
23    287.0
24    287.0
25    287.0
26    287.0
27    286.0
28    287.0
29    286.0
      ...  
47    286.0
48    287.0
49    287.0
50    287.0
51    286.0
52    287.0
53    287.0
54    286.0
55    287.0
56    287.0
57    287.0
58    287.0
59    287.0
60    286.0
61    287.0
62    286.0
63    287.0
64    287.0
65    287.0
66    287.0
67    286.0
68    286.0
69    286.0
70    287.0
71    287.0
72    286.0
73    287.0
74    287.0
75    286.0
76      NaN
Name: Year, Length: 77, dtype: float64

In [52]:
year > 2009

0     False
1      True
2      True
3      True
4     False
5      True
6     False
7     False
8      True
9      True
10    False
11     True
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21     True
22     True
23     True
24     True
25     True
26     True
27    False
28     True
29    False
      ...  
47    False
48    False
49     True
50     True
51    False
52     True
53     True
54    False
55     True
56     True
57     True
58    False
59     True
60    False
61    False
62    False
63     True
64     True
65     True
66    False
67    False
68    False
69    False
70     True
71     True
72    False
73    False
74     True
75    False
76    False
Name: Year, Length: 77, dtype: bool

In [56]:
genre = movie['Genre']
genre == 'Animation'

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22     True
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64     True
65     True
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76    False
Name: Genre, Length: 77, dtype: bool

## Chaining Series methods together

In [68]:
movie = pd.read_csv('data/movies.csv')
year = movie['Year']

Python treats false values as 0 and true values as 1. So this method returns the amount of missing values.

In [58]:
year.isnull().sum()

1

In [61]:
year.dtype

dtype('float64')

The dtype of Year column is float64, but we want it to be int dtype. So to convert the year Series dtype into new dtype use the **astype()** method. But first need to fill NaN values with fillna() method.

In [71]:
year = year.fillna(0).astype(int).head()
year

0    2008
1    2010
2    2010
3    2010
4    2008
Name: Year, dtype: int64

## Making the index meaningful

In [74]:
movie = pd.read_csv('data/movies.csv')
movie

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
3,When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0
5,Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0
6,WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0
7,Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0
8,Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0
9,Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0


In [76]:
movie2 = movie.set_index('Film')
movie2

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0
Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0
WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0
Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0
Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0


To achieve the same result as the previous one it can be directly set the index inside the *read_csv()* function using the attribute **index_col**

In [100]:
movie = pd.read_csv('data/movies.csv', index_col='Film')
movie

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0
Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0
WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0
Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0
Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0


In [101]:
movie.columns

Index(['Genre', 'Lead Studio', 'Audience score %', 'Profitability',
       'Rotten Tomatoes %', 'Worldwide Gross', 'Year'],
      dtype='object')

Conversely, it is possible to turn the index into a column with **reset_index()** method. This method will make the Film index a column again.

In [104]:
movie.reset_index()

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
3,When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0
5,Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0
6,WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0
7,Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0
8,Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0
9,Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0


## Renaming row and column names

In [109]:
movie = pd.read_csv('data/movies.csv', index_col='Film')

In [110]:
idx_rename = {'WALL-E':'Wall-e', 'Zack and Miri Make a Porno':'Zack and Miri make a porno'}
col_rename = {'Lead Studio':'The Lead Studio'}
movie.rename(index=idx_rename, columns=col_rename).head()

Unnamed: 0_level_0,Genre,The Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Zack and Miri make a porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
Youth in Revolt,Comedy,The Weinstein Company,52,1.09,68,$19.62,2010.0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
When in Rome,Comedy,Disney,44,0.0,15,$43.04,2010.0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0


Another way to rename index or column names is to get the list of indexes and columns and change the names that I want.

In [121]:
index_list = movie.index.tolist()
col_list = movie.columns.tolist()
col_list[1] = 'Lead Studio I changed you'

Then, reassign the list of indexes and or columns to the DataFrame

In [124]:
movie.columns = col_list
movie

Unnamed: 0,Genre,Lead Studio I changed you,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0
Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0
When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0
Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0
WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0
Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0
Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0


## Creating and deleting columns

In [3]:
movie = pd.read_csv('data/movies.csv')
movie['unseen_film'] = 0

In [4]:
movie.columns

Index(['Film', 'Genre', 'Lead Studio', 'Audience score %', 'Profitability',
       'Rotten Tomatoes %', 'Worldwide Gross', 'Year', 'unseen_film'],
      dtype='object')

In [128]:
movie['test_sum'] = (movie['Year'] + movie['Rotten Tomatoes %'])

In [5]:
movie

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year,unseen_film
0,Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0,0
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0,0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0,0
3,When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0,0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0,0
5,Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0,0
6,WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0,0
7,Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0,0
8,Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0,0
9,Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0,0


In [131]:
movie['test_sum'].isnull().sum()

1

In [132]:
movie['test_sum'] = movie['test_sum'].fillna(0)

In [133]:
movie['test_sum'].isnull().sum()

0

In [134]:
movie['is_greater'] = (movie['Audience score %'] >= movie['Rotten Tomatoes %'])

**all()** method asserts if all elements are **true**

In [135]:
movie['is_greater'].all()

False

The method **drop()** delete row with the specified index

In [143]:
movie = movie.set_index('Film')
movie

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year,unseed_film
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008.0,0
Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0,0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0,0
When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0,0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0,0
Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0,0
WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0,0
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0,0
Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0,0
Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0,0


In [144]:
movie.drop('Zack and Miri Make a Porno')

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year,unseed_film
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0,0
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0,0
When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0,0
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0,0
Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0,0
WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0,0
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0,0
Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0,0
Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0,0
Tyler Perry's Why Did I get Married,Romance,Independent,47,3.724192,46,$55.86,2007.0,0


If the attribute **axis='columns'** (or, **axis=1**, 1 indicates the column whilst 0 is the default value and indicates the row) is provided, then the column with that header is dropped.

In [150]:
movie.drop('Genre', axis='columns')

Unnamed: 0,Film,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year,unseed_film
0,Zack and Miri Make a Porno,The Weinstein Company,70,1.747542,64,$41.94,2008.0,0
1,Youth in Revolt,The Weinstein Company,52,1.090000,68,$19.62,2010.0,0
2,You Will Meet a Tall Dark Stranger,Independent,35,1.211818,43,$26.66,2010.0,0
3,When in Rome,Disney,44,0.000000,15,$43.04,2010.0,0
4,What Happens in Vegas,Fox,72,6.267647,28,$219.37,2008.0,0
5,Water For Elephants,20th Century Fox,72,3.081421,60,$117.09,2011.0,0
6,WALL-E,Disney,89,2.896019,96,$521.28,2008.0,0
7,Waitress,Independent,67,11.089741,89,$22.18,2007.0,0
8,Waiting For Forever,Independent,53,0.005000,6,$0.03,2011.0,0
9,Valentine's Day,Warner Bros.,54,4.184038,17,$217.57,2010.0,0


In [153]:
movie.drop(0, axis=0)

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year,unseed_film
1,Youth in Revolt,Comedy,The Weinstein Company,52,1.090000,68,$19.62,2010.0,0
2,You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010.0,0
3,When in Rome,Comedy,Disney,44,0.000000,15,$43.04,2010.0,0
4,What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008.0,0
5,Water For Elephants,Drama,20th Century Fox,72,3.081421,60,$117.09,2011.0,0
6,WALL-E,Animation,Disney,89,2.896019,96,$521.28,2008.0,0
7,Waitress,Romance,Independent,67,11.089741,89,$22.18,2007.0,0
8,Waiting For Forever,Romance,Independent,53,0.005000,6,$0.03,2011.0,0
9,Valentine's Day,Comedy,Warner Bros.,54,4.184038,17,$217.57,2010.0,0
10,Tyler Perry's Why Did I get Married,Romance,Independent,47,3.724192,46,$55.86,2007.0,0
