Pandas is a popular Python module that provides high performance data 
structures and data analysis tools. Pandas is widely used to transform 
raw data for data analysis and machine learning. We will learn: 

• data frames and data series 

• reading from files 

• data transformation 

• data visualization 

• statistical analysis

Wes McKinney developed on Pandas and open sourced it in 2009. Later Chang She become the primary contributor. 

## Data frames and Data series

Series - is a one-dimensional Python object that corresponds to one column 
in a table.

In [None]:
'''
First things first, let's import pandas
'''
import pandas as pd

Creating a data series from a list

In [None]:
list1 = ['Grapes', 'Apples', 'Oranges', 'Bananas']
s1 = pd.Series(list1)
print(s1)

Notice that rows are given numbers, these numbers are known as indices. Indices starts from 0 and go up.  

We can provide custom index as well. 

In [None]:
# defining list2 with indices
list2 = ['GR', 'AP', 'OR', 'BA']

# In series1 we say index=list2
series1 = pd.Series(list1, index=list2)
print(series1)

Now let us define a dictionary and convert it into a series.

In [None]:
d1 = {'Z': 'Zynga', 'U': 'Uber', 'G':'Google'}
ds1 = pd.Series(d1)
print(ds1)

Let's create a series with company name as index and its current stock price as 
value. Notice that in the code below, we have a dictionary with two keys 
having None as their values.

In [None]:
d2={'Amazon': 852, 'Nvidia': None, 'Alphabet': 856, 'Toyota': '112', \
    'GE': 29, 'Ford': 12, 'Marriot': None, 'amazon': 1000}
companies = pd.Series(d2, name='Price')
print(companies)

Using the index we can get the corresponding stock prices. Below we are 
retrieving the stock price of Ford.

In [None]:
print(companies['Ford'])

We can also get stock prices for more companies. We have to supply the indices 
that we are interested in as a list.

In [None]:
print(companies[['Ford', 'GE']])

Membership can be checked using the 'in' keyword.

In [None]:
print('Amazon' in companies)
print('Apple' in companies)

If we want to know companies for which we don't have stock price, then we have 
to use the is.null(). True will be returned for the indices that don't have a 
value and False for the indices that have 
a value.

In [None]:
print(companies.isnull())

A dataframe is a tabular data structure that consists of rows and columns. 
Dataframe is nothing but a collection of series.

Let's create a dataframe using a dicitonary as shown below.

In [None]:
c1= {'Name': ['Amazon', 'GE', 'Toyota', 'Twitter', 'Ford', 'Marriot'],
            'Founded': [1994, 1923, 1937, 2006, 1903, 1927], 
             'Price': [852, 111.2, 112, 15.2, 12.5, 88.31]}

companies = pd.DataFrame(c1, columns=['Name', 'Founded', 'Price'])
print(companies)

In [None]:
c1= {'Name': ['Amazon', 'GE', 'Toyota', 'Twitter', 'Ford', 'Marriot'],
            'Founded': [1994, 1923, 1937, 2006, 1903, 1927], 
             'Price': [852, 111.2, 112, 15.2, 12.5, 88.31]}

companies = pd.DataFrame(c1)
print(companies)

In [None]:
"""
In-class activity: Create a data series which comprises of names of 6 
US capitals. Print the contents of the data series.
"""

In [None]:
"""
In-class acitvity: To the above data series include state of each capital 
as an index. 
"""

## Reading files

Reading a csv file. 

In [None]:
movies = pd.read_csv('alldata\imdb_movie\movie_metadata.csv')
print(movies.head())

In [None]:
print(movies.columns)

In [None]:
print(movies.dtypes)

In [None]:
print(movies.shape)

In [None]:
# for number of rows use shape[0]
print(movies.shape[0])

In [None]:
# for number of columns use shape[1]
print(movies.shape[1])

We can create a series from the movies dataframe. 

In [None]:
movies_dir = movies['director_name']
print(movies_dir.head())

Let's create a new dataframe with columns: 
movie_title, duration, budget, gross, genres, director_name.

In [None]:
newmovies = movies[['movie_title', 'duration', 'budget', 'gross',\
                    'genres', 'director_name']]
print(newmovies.head())

In [None]:
print(newmovies.shape)

We can use sort_values() to sort a dataframe.

In [None]:
print(newmovies.sort_values('director_name').head())

Notice that not all values in our data frame are finite. So, now we want to drop 
rows that have NaN in any column. 

## Data Transformation

In [None]:
newmovies1 = newmovies.copy(deep=True)
newmovies1.dropna(how='any', inplace=True)
print(newmovies1.head())
print(newmovies1.shape)

In [None]:
newmovies2 = newmovies.copy(deep=True)
newmovies2.dropna(subset=['duration','budget'], how='any', inplace=True)

In [None]:
print(newmovies2.head())

In [None]:
print(newmovies2.shape)

In [None]:
new_gross = newmovies[newmovies['gross']>350000]
print(new_gross.shape)
print(new_gross.head())

## Statistical Analysis

In [None]:
print(newmovies1.describe())

We can find how many values in each column of newmovies has NAN using isnull().

In [None]:
print(newmovies.isnull().sum())

In [None]:
newmovies3 = newmovies.copy(deep=True)
newmovies3['duration'].fillna(value=90, inplace=True)
print(newmovies3.isnull().sum())

In [None]:
"""
In-class activity: From the movies data frame, create a new data frame that 
comprises of movie title, duration, budget and gross. 
1) Find the number of NaN in gross. 
2) Replace NaN in gross with the mean of gross.
"""

In [None]:
url= "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
countries=pd.read_csv(url)

In [None]:
print(countries.head())
print(countries.shape[0])

In [None]:
print(countries.describe())

## Data Visualization 

The below line will make sure that the image created by the mathplot will be 
shown inside Jupyter notebook.

In [None]:
%matplotlib inline

Let us plot a histogram for duration time.

In [None]:
newmovies3.duration.plot(kind='hist')

Let's consider another dataset to understand different plotting choices. 

In [None]:
company = pd.read_csv('company.csv')
print(company.head())

In [None]:
company = pd.read_csv('company.csv')
company = company.set_index('Name')
print(company.head())

We plot a scatter plot between the columns sales_budget and marketing_budget.

In [None]:
company.plot(kind='scatter', x='sales_budget', y='marketing_budget')

Creating a series with column sales_budget.

In [None]:
sales = company['sales_budget']
print(sales)

In [None]:
sales_plot = sales.plot(kind='bar')
sales_plot.set_xlabel("Company Name")
sales_plot.set_ylabel("Sales")

In [None]:
sales_plot = sales.plot(kind='bar', x="Company Name", y="Sales")

In [None]:
"""
In-class activity: Use the companies data frame and create a data series 
with company name and marketing budget. 
1) Create a bar graph with company name on the x-axis and 
marketing budget on the y-axis. 
"""