# What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



# Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

# What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



In [3]:
import pandas as pd

In [6]:
my_dataset = {
    'cars': ["BMW", "Volvo", "Ford"],
    'passings': [3, 7, 2]
}

myvar = pd.DataFrame(my_dataset)
myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


## Checking pandas version
The version string is stored under "__ version __" attribute

In [8]:
import pandas as pd
version = pd.__version__
print(version)

2.2.2


# What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

### Example
Create a simple Pandas Series from a list:

In [12]:
import pandas as pd

a = [1, 7, 2]

my_serie = pd.Series(a)

my_serie

0    1
1    7
2    2
dtype: int64

## Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

### Example
Return the first value of the Series:

In [13]:
my_serie[0]

1

# Create Labels
With the index argument, you can name your own labels.

### Example
Create you own labels:

In [14]:
import pandas as pd

a = [1, 7, 2]

my_serie = pd.Series(a, index=['x', 'y', 'z'])

my_serie

x    1
y    7
z    2
dtype: int64

When you have created labels, you can access an item by referring to the  label.

In [15]:
print(my_serie["z"])

2


## Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

### Example
Create a simple Pandas Series from a dictionary:

In [16]:
import pandas as pd

calories = {
    'day1': 420,
    'day2': 380, 
    'day3': 390
}

my_serie = pd.Series(calories)
my_serie

day1    420
day2    380
day3    390
dtype: int64

* Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

#### Example
Create a Series using only data from "day1" and "day2":


In [18]:
import pandas as pd

calories = {
    'day1': 420,
    'day2': 380,
    'day3': 390
}

my_serie = pd.Series(calories, index=["day1", 'day3'])
my_serie

day1    420
day3    390
dtype: int64

## DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Series is like a column, a DataFrame is the whole table.

### Example
Create a DataFrame from two Series:

In [19]:
import pandas as pd

data = {
    'calories': [420, 380, 390],
    'duration': [50, 40, 45]
}

my_df = pd.DataFrame(data)
my_df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


## Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

### Example
Return row 0:

In [20]:
# refer to the row index:
print(my_df.loc[0])

# Note: This example returns a Pandas Series.

calories    420
duration     50
Name: 0, dtype: int64


### Example
Return row 0 and 1:

In [22]:
print(my_df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


Note: When using [], the result is a Pandas DataFrame.

## Named Indexes
With the index argument, you can name your own indexes.

### Example
Add a list of names to give each row a name:

In [23]:
import pandas as pd

data = {
    'calories': [420, 380, 390],
    'duration': [50, 40, 45]
}

df = pd.DataFrame(data, index= ["day1", "day2", "day3"])

df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


## Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

### Example
Return "day2"

In [27]:
print(df.loc[["day2"]])

      calories  duration
day2       380        40


# Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.

### Example
Load a comma separated file (CSV file) into a DataFrame:

In [29]:
import pandas as pd

df = pd.read_csv("./Dataset/pokemon_data.csv")

df

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [39]:
first_five_rows = df.loc[[0, 1, 2, 3, 4, 5, 6]]
first_five_rows

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False


# Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [43]:
import pandas as pd

df = pd.read_csv('./Dataset/data.csv')

df

# By default, when you print a DataFrame, you will only get the first 5 rows,
# and the last 5 rows:

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


Tip: use to_string() to print the entire DataFrame.

In [44]:
import pandas as pd

data = pd.read_csv("./Dataset/data.csv")

print(data.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

# Read JSON
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

In [47]:
import pandas as pd

df = pd.read_json('file:///D:/www.w3schools.com/python/pandas/data.js')

print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

## Dictionary as JSON
JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:

#### Example
Load a Python Dictionary into a DataFrame:

In [48]:
import pandas as pd

data = {
    'Duration':{
        "0": 60,
        "1": 60,
        "2":60,
        "3":45,
        "4":45,
        "5":60
    }, 
    'Pulse':{
       "0":110,
        "1":117,
        "2":103,
        "3":109,
        "4":117,
        "5":102 
    },
    'Maxpulse': {
        "0":130,
        "1":145,
        "2":135,
        "3":175,
        "4":148,
        "5":127
    }, 
    'Calories':{
        "0":409,
        "1":479,
        "2":340,
        "3":282,
        "4":406,
        "5":300
    }
}

df = pd.DataFrame(data)

df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


# Pandas - Analyzing DataFrames

## Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

### Example
Get a quick overview by printing the first 10 rows of the DataFrame:

In [49]:
import pandas as pd

df = pd.read_csv("./Dataset/data.csv")

df.head(10)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

### Example
Print the last 5 rows of the DataFrame:

In [50]:
import pandas as pd

df = pd.read_csv("./Dataset/data.csv")

df.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


# Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

## Example
Print information about the data:

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
