# Pandas
Pandas is a Python library for data analysis. It provides a variety of data structures with many functionalities. This notebook was created while following [pandas Foundation](https://www.datacamp.com/courses/pandas-foundations) on DataCamp.

Since dataset was not provided in the course, I used a dataset different than the one used in the course. The Pokemon dataset was downloaded from [here](https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6).

In [1]:
# import pandas and numpy
import pandas as pd
import numpy as np

Read a dataset in csv format and inspect the content.

In [2]:
pokemons = pd.read_csv('datasets/pokemon.csv', index_col=0)
pokemons

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False


As we can see from the result above, the dataset is way too big for us to go through manually. Pandas provides many functions that allow us to inspect the data easily.

In [3]:
print(type(pokemons))  # check the type of the data that we just read
print(pokemons.shape)  # check the shape/size of the DataFrame

print(pokemons.columns)  # show the names of the columns
print(type(pokemons.columns))  # check the type of column headers

pokemon_names = pokemons['Name']  # extract a single column from the DataFrame
print(type(pokemon_names))  # check the type of a single column

<class 'pandas.core.frame.DataFrame'>
(800, 12)
Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')
<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.series.Series'>


We still want to be able to see the actual data that we are dealing with. Head and tail are two convenient functions that let us take a peek at the data at the top and bottom.

In [4]:
pokemons.head()  # default is 5, but we can also number of rows to display

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [5]:
pokemons.tail(3)  # tail lets you look at the last few rows

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


In [6]:
pokemons.info()  # display an useful summary about the data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 1 to 721
Data columns (total 12 columns):
Name          800 non-null object
Type 1        800 non-null object
Type 2        414 non-null object
Total         800 non-null int64
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Sp. Atk       800 non-null int64
Sp. Def       800 non-null int64
Speed         800 non-null int64
Generation    800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(8), object(3)
memory usage: 75.8+ KB


Other than normal Python list slicing, pandas provide other ways to select or specify data in the DataFrame (iloc, loc, ix). You can read more about advance data selection in pandas [here](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/).

In [7]:
# iloc; integer-location based indexing
pokemons.iloc[:,-2] = 0   # [row, col] syntax. Change the second to last column to 0 through broadcasting
pokemons

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,0,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,0,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,0,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,0,False
4,Charmander,Fire,,309,39,52,43,60,50,65,0,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,0,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,0,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,0,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,0,False
7,Squirtle,Water,,314,44,48,65,50,64,43,0,False


In [8]:
np_vals = pokemons.values # convert DataFrame type to numpy
np_vals

array([['Bulbasaur', 'Grass', 'Poison', ..., 45, 0, False],
       ['Ivysaur', 'Grass', 'Poison', ..., 60, 0, False],
       ['Venusaur', 'Grass', 'Poison', ..., 80, 0, False],
       ...,
       ['HoopaHoopa Confined', 'Psychic', 'Ghost', ..., 70, 0, True],
       ['HoopaHoopa Unbound', 'Psychic', 'Dark', ..., 80, 0, True],
       ['Volcanion', 'Fire', 'Water', ..., 70, 0, True]], dtype=object)

We can build dataset from scratch in DataFrame as well.

In [9]:
# build a DataFrame using Python dictionary
my_data_1 = {'weekdays': ['Sun', 'Sun', 'Mon', 'Mon'],
        'cities': ['Austin', 'Dallas', 'Austin', 'Dallas'],
        'visitors': [139, 237, 326, 456],
        'signups': [7, 12, 3, 5]}
users_1 = pd.DataFrame(my_data_1)
print(users_1)

  weekdays  cities  visitors  signups
0      Sun  Austin       139        7
1      Sun  Dallas       237       12
2      Mon  Austin       326        3
3      Mon  Dallas       456        5


In [10]:
# another way to build a DataFrame
list_labels = ['weekdays', 'cities', 'visitors', 'signups']
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
visitors = [139, 237, 326, 456]
signups = [7, 12, 3, 5]
list_cols = [weekdays, cities, visitors, signups]
zipped = list(zip(list_labels, list_cols))
my_data_2 = dict(zipped)
users_2 = pd.DataFrame(my_data_2)
print(users_2)

  weekdays  cities  visitors  signups
0      Sun  Austin       139        7
1      Sun  Dallas       237       12
2      Mon  Austin       326        3
3      Mon  Dallas       456        5


In [11]:
# broadcasting
users_1['fees'] = 0
print(users_1)

  weekdays  cities  visitors  signups  fees
0      Sun  Austin       139        7     0
1      Sun  Dallas       237       12     0
2      Mon  Austin       326        3     0
3      Mon  Dallas       456        5     0


In [12]:
heights = [59.0, 65.2, 62.9, 65.4, 63.7, 65.7, 64.1]
my_data_3 = {'height': heights, 'sex': 'M'}
results = pd.DataFrame(my_data_3)
print(results)

   height sex
0    59.0   M
1    65.2   M
2    62.9   M
3    65.4   M
4    63.7   M
5    65.7   M
6    64.1   M


To save your new DataFrame, pandas provide a number of functions to save the data to different file format.

In [13]:
users_1.to_csv('new.csv', index=False)  # save the data as a csv file
# users_1.to_excel('new.xlsx', index=False)  # save it as excel file