# Pandas
## Introduction to  Pandas

Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.


In the following lessons you will learn:

* How to import Pandas
* How to create Pandas Series and DataFrames using various methods
* How to access and change elements in Series and DataFrames
* How to perform arithmetic operations on Series
* How to load data into a DataFrame
* How to deal with Not a Number (NaN) values

The following lessons assume that you are already familiar with NumPy and have gone over the previous NumPy lessons. Therefore, to avoid being repetitive we will omit a lot of details already given in the NumPy lessons. Consequently, if you haven't seen the NumPy lessons we suggest you go over them first.

#### Downloading Pandas

Pandas is included with Anaconda. If you don't already have Anaconda installed on your computer, please refer to the Anaconda section to get clear instructions on how to install Anaconda on your PC or Mac.


#### Pandas Versions
As with many Python packages, Pandas is updated from time to time. The following lessons were created using Pandas version 0.22. You can check which version of Pandas you have by typing !conda list pandas in your Jupyter notebook or by typing conda list pandas in the Anaconda prompt. If you have another version of Pandas installed in your computer, you can update your version by typing conda install pandas=0.22 in the Anaconda prompt. As newer versions of Pandas are released, some functions may become obsolete or replaced, so make sure you have the correct Pandas version before running the code. This will guarantee your code will run smoothly.


#### Pandas Documentation
Pandas is remarkable data analysis library and it has many functions and features. In these introductory lessons we will only scratch the surface of what Pandas can do. If you want to learn more about Pandas, make sure you check out the Pandas Documentation:

Pandas Documentation

https://pandas.pydata.org/pandas-docs/stable/

In [None]:
!conda list pandas

### Why Use Pandas?

A few features that makes Pandas an excellent package for data analysis:

* Allows the use of labels for rows and columns
* Can calculate rolling statistics on time series data
* Easy handling of NaN values
* Is able to load data of different formats into DataFrames
* Can join and merge different datasets together
* It integrates with NumPy and Matplotlib


## Creating Pandas Series

In [None]:
import pandas as pd

#### pd.Series(data, index),

In [None]:
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])

groceries

#### .shape , .ndim , .size , .index , .values

In [None]:
groceries.shape

In [None]:
groceries.ndim

In [None]:
groceries.size

In [None]:
groceries.index

In [None]:
groceries.values

#### in

In [None]:
'banana' in groceries

In [None]:
'bread' in groceries

### Accessing and Deleting Elements in Pandas Series

 Elements can be accessed using index labels or numerical indices inside square brackets.
 
 Pandas Series have two attributes, *.loc* and *.iloc*
 
 .loc  : labeled index
 
 .iloc : numerical index

*.drop(label)* : delete item

inplace = True  : modify original data

### Arithmetic Operations on Pandas Series

In [None]:
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
fruits

In [None]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

In [None]:
# We import NumPy as np to be able to use the mathematical functions
import numpy as np

# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

In [None]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
print()

# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
print()

# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
print()

# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
print()

# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

In [None]:
groceries * 2

### Quiz : Manipulate a Series

In [None]:
import pandas as pd

# Create a Pandas Series that contains the distance of some planets from the Sun.
# Use the name of the planets as the index to your Pandas Series, and the distance
# from the Sun as your data. The distance from the Sun is in units of 10^6 km

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

# Create a Pandas Series using the above data, with the name of the planets as
# the index and the distance from the Sun as your data.
dist_planets = pd.Series(data=distance_from_sun,index=planets)

# Calculate the number of minutes it takes sunlight to reach each planet. You can
# do this by dividing the distance from the Sun for each planet by the speed of light.
# Since in the data above the distance from the Sun is in units of 10^6 km, you can
# use a value for the speed of light of c = 18, since light travels 18 x 10^6 km/minute.
time_light = dist_planets/18

# Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
close_planets = time_light[time_light<40]
close_planets

## Creating Pandas DataFrames

#### pd.DataFrame()

In [2]:
# We import Pandas as pd into Python
import pandas as pd

# We create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# We print the type of items to see that it is a dictionary
print(type(items))

<class 'dict'>


In [None]:
# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)

# We display the DataFrame
shopping_carts

In [None]:
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

In [3]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


## Accessing Elements in pandas DataFrames

When we accessing elements in DF, column label first, and then the row label.

In [4]:
store_items

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


In [5]:
print(store_items[['bikes']])
print()
print(store_items[['bikes','pants']])
print()
print(store_items.loc[['store 2']])
print()
print(store_items['bikes']['store 2'])

         bikes
store 1     20
store 2     15

         bikes  pants
store 1     20     30
store 2     15      5

         bikes  glasses  pants  watches
store 2     15     50.0      5       10

15


In [6]:
store_items['shirts']=[15,2]
store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts
store 1,20,,30,35,15
store 2,15,50.0,5,10,2


In [7]:
store_items['suites']=store_items['shirts']+store_items['pants']
store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts,suites
store 1,20,,30,35,15,45
store 2,15,50.0,5,10,2,7


#### .append(new_df)

In [8]:
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]
new_store = pd.DataFrame(new_items, index = ['store 3'])
new_items

[{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

In [9]:
store_items = store_items.append(new_store)
store_items

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,bikes,glasses,pants,shirts,suites,watches
store 1,20,,30,15.0,45.0,35
store 2,15,50.0,5,2.0,7.0,10
store 3,20,4.0,30,,,35


In [10]:
store_items['new_watches']=store_items['watches'][1:]
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suites,watches,new_watches
store 1,20,,30,15.0,45.0,35,
store 2,15,50.0,5,2.0,7.0,10,10.0
store 3,20,4.0,30,,,35,35.0


#### .insert(loc,label,data)

In [11]:
store_items.insert(5,'shoes',[8,5,0])
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suites,shoes,watches,new_watches
store 1,20,,30,15.0,45.0,8,35,
store 2,15,50.0,5,2.0,7.0,5,10,10.0
store 3,20,4.0,30,,,0,35,35.0


##### .pop : delete columns

##### .drop: delete depends on axis

In [12]:
store_items = store_items.drop(['store 1','store 2'],axis=0)
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suites,shoes,watches,new_watches
store 3,20,4.0,30,,,0,35,35.0


#### .rename(columns={original_name:new_name})

#### .rename(index={original_index:new_index}

#### . set_indec('column')

In [13]:
store_items=store_items.rename(columns={'bikes':'hats'})
print(store_items)
print()
store_items=store_items.rename(index={'store 3':'last store'})
print(store_items)

         hats  glasses  pants  shirts  suites  shoes  watches  new_watches
store 3    20      4.0     30     NaN     NaN      0       35         35.0

            hats  glasses  pants  shirts  suites  shoes  watches  new_watches
last store    20      4.0     30     NaN     NaN      0       35         35.0


In [14]:
store_items=store_items.set_index('pants')
store_items

Unnamed: 0_level_0,hats,glasses,shirts,suites,shoes,watches,new_watches
pants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
30,20,4.0,,,0,35,35.0


## Dealing with NaN

In [15]:
import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,10,,35


#### .isnull() : returns boolean df, True indicate Nan
#### .count() : counting non-NaN values

In [17]:
x=store_items.isnull().sum().sum()
print(x)

3


In [18]:
store_items.count()

bikes      3
glasses    2
pants      3
shirts     2
shoes      3
suits      2
watches    3
dtype: int64

#### .dropna(axis= 0 or 1)   / inplace

In [19]:
store_items.dropna(axis=0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 2,15,50.0,5,2.0,5,7.0,10


In [20]:
store_items.dropna(axis=1)

Unnamed: 0,bikes,pants,shoes,watches
store 1,20,30,8,35
store 2,15,5,5,10
store 3,20,30,10,35


#### .fillna(value you want to replace)
#### .fillna(method= , axis= )
   ffill : forward filling means replace NaN with previous value along given axis
   
   backfill : backward filling
   
   linear : replace NaN using values along the given axis
   
#### .interpolate(method=, axis= )
   replace NaN out of place

In [149]:
import pandas as pd
import numpy as np

# Since we will be working with ratings, we will set the precision of our 
# dataframes to one decimal place.
pd.set_option('precision', 1)

# Create a Pandas DataFrame that contains the ratings some users have given to a
# series of books. The ratings given are in the range from 1 to 5, with 5 being
# the best score. The names of the books, the authors, and the ratings of each user
# are given below:

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])

user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Users that have np.nan values means that the user has not yet rated that book.
# Use the data above to create a Pandas DataFrame that has the following column
# labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
# automatically assign numerical row indices to the DataFrame. 

# Create a dictionary with the data given above
dat =dict(Author=authors,Book_Title=books,User_1=user_1,User_2=user_2,User_3=user_3,User_4=user_4)


# Use the dictionary to create a Pandas DataFrame
book_ratings = pd.DataFrame(dat)
book_ratings['User_1'].fillna(np.mean(user_1),inplace=True)
book_ratings['User_2'].fillna(np.mean(user_2),inplace=True)
book_ratings['User_3'].fillna(np.mean(user_3),inplace=True)
book_ratings['User_4'].fillna(np.mean(user_4),inplace=True)
book_ratings.rename(columns={'Book_Title':'Book Title','User_1':'User 1','User_2':'User 2','User_3':'User 3','User_4':'User 4'},inplace=True)
# If you created the dictionary correctly you should have a Pandas DataFrame
# that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
# 'User 4' and row indices 0 through 4.

# Now replace all the NaN values in your DataFrame with the average rating in
# each column. Replace the NaN values in place. HINT: you can use the fillna()
# function with the keyword inplace = True, to do this. Write your code below:





In [143]:
import pandas as pd
import numpy as np
pd.set_option('precision', 1)

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Jul iet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])
#labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'
user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

In [144]:

dat = dict(Author=authors,Book_Title=books,User_1=user_1,User_2=user_2,User_3=user_3,User_4=user_4)
book_ratings = pd.DataFrame(dat)
book_ratings

Unnamed: 0,Author,Book_Title,User_1,User_2,User_3,User_4
0,Charles Dickens,Great Expectations,3.2,5.0,2.0,4.0
1,John Steinbeck,Of Mice and Men,,1.3,2.3,3.5
2,William Shakespeare,Romeo and Jul iet,2.5,4.0,,4.0
3,H. G. Wells,The Time Machine,,3.8,4.0,5.0
4,Lewis Carroll,Alice in Wonderland,,,,4.2


In [145]:
book_ratings['User_1'].fillna(np.mean(user_1),inplace=True)
book_ratings['User_2'].fillna(np.mean(user_2),inplace=True)
book_ratings['User_3'].fillna(np.mean(user_3),inplace=True)
book_ratings['User_4'].fillna(np.mean(user_4),inplace=True)

In [146]:
book_ratings

Unnamed: 0,Author,Book_Title,User_1,User_2,User_3,User_4
0,Charles Dickens,Great Expectations,3.2,5.0,2.0,4.0
1,John Steinbeck,Of Mice and Men,2.9,1.3,2.3,3.5
2,William Shakespeare,Romeo and Jul iet,2.5,4.0,2.8,4.0
3,H. G. Wells,The Time Machine,2.9,3.8,4.0,5.0
4,Lewis Carroll,Alice in Wonderland,2.9,3.5,2.8,4.2


In [147]:
book_ratings.rename(columns={'Book_Title':'Book Title','User_1':'User 1','User_2':'User 2','User_3':'User 3','User_4':'User 4'},inplace=True)

In [148]:
book_ratings

Unnamed: 0,Author,Book Title,User 1,User 2,User 3,User 4
0,Charles Dickens,Great Expectations,3.2,5.0,2.0,4.0
1,John Steinbeck,Of Mice and Men,2.9,1.3,2.3,3.5
2,William Shakespeare,Romeo and Jul iet,2.5,4.0,2.8,4.0
3,H. G. Wells,The Time Machine,2.9,3.8,4.0,5.0
4,Lewis Carroll,Alice in Wonderland,2.9,3.5,2.8,4.2


In [151]:
best_rated = book_ratings[(book_ratings == 5).any(axis = 1)]['Book Title'].values