# Pandas

## Pandas basics
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very important in data science.

Create an alias with the `as` keyword while importing:

In [1]:
# import pandas
import pandas as pd
# import numpy
import numpy as np

Now the Pandas package can be referred to as `pd` instead of `pandas`.

## Series

In [2]:
name = ['Andy', 'Ron', 'James', 'Edy']
myseries1 = pd.Series(name)

In [3]:
print(myseries1)

0     Andy
1      Ron
2    James
3      Edy
dtype: object


In [4]:
age = [21, 35, 54, 33]
myseries2 = pd.Series(age)

In [5]:
print(myseries2)

0    21
1    35
2    54
3    33
dtype: int64


### DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Create simple DataFrame

In [7]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45


**Locate Row**: As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the `loc` attribute to return one or more specified row(s)

In [8]:
#refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


In [9]:
#use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


**Named Indexes**: With the index argument, you can name your own indexes.

In [10]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


**Locate Named Indexes**: 
Use the named index in the loc attribute to return the specified row(s).

In [11]:
#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


**Read CSV Files**: 
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [14]:
import pandas as pd

df = pd.read_csv('data/gdp.csv')

In [16]:
df.head()  # first five rows

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Aruba,ABW,,,,,,,,,...,2727850000.0,2790850000.0,2962907000.0,2983635000.0,3092429000.0,3276184000.0,3395799000.0,2558906000.0,3103184000.0,3544708000.0
1,Africa Eastern and Southern,AFE,18478100000.0,19366310000.0,20506470000.0,22242730000.0,24294330000.0,26619560000.0,28732790000.0,31592960000.0,...,986343000000.0,1006990000000.0,932513000000.0,890051000000.0,1028390000000.0,1012520000000.0,1006190000000.0,928880000000.0,1086530000000.0,1185140000000.0
2,Afghanistan,AFG,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,1673333000.0,...,20146420000.0,20497130000.0,19134220000.0,18116570000.0,18753460000.0,18053220000.0,18799440000.0,19955930000.0,14266500000.0,
3,Africa Western and Central,AFW,10411650000.0,11135920000.0,11951710000.0,12685810000.0,13849000000.0,14874760000.0,15845580000.0,14428490000.0,...,834097000000.0,894505000000.0,769263000000.0,692115000000.0,685630000000.0,768158000000.0,823406000000.0,786962000000.0,844928000000.0,875394000000.0
4,Angola,AGO,,,,,,,,,...,132339000000.0,135967000000.0,90496420000.0,52761620000.0,73690160000.0,79450690000.0,70897960000.0,48501560000.0,66505130000.0,106783000000.0


In [17]:
df.tail() # last five rows

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
261,Kosovo,XKX,,,,,,,,,...,6735329000.0,7074395000.0,6295848000.0,6682677000.0,7180765000.0,7878760000.0,7899738000.0,7717145000.0,9412034000.0,9409474000.0
262,"Yemen, Rep.",YEM,,,,,,,,,...,40415230000.0,43228590000.0,42444490000.0,31317820000.0,26842230000.0,21606160000.0,,,,
263,South Africa,ZAF,8748597000.0,9225996000.0,9813996000.0,10854200000.0,11956000000.0,13068990000.0,14211390000.0,15821390000.0,...,400886000000.0,381199000000.0,346710000000.0,323586000000.0,381449000000.0,405261000000.0,389330000000.0,338291000000.0,420118000000.0,405271000000.0
264,Zambia,ZMB,713000000.0,696285700.0,693142900.0,718714300.0,839428600.0,1082857000.0,1264286000.0,1368000000.0,...,28037240000.0,27141020000.0,21251220000.0,20958410000.0,25873600000.0,26311510000.0,23308670000.0,18110640000.0,22096420000.0,29163780000.0
265,Zimbabwe,ZWE,1052990000.0,1096647000.0,1117602000.0,1159512000.0,1217138000.0,1311436000.0,1281750000.0,1397002000.0,...,19091020000.0,19495520000.0,19963120000.0,20548680000.0,17584890000.0,34156070000.0,21832230000.0,21509700000.0,28371240000.0,27366630000.0


In [18]:
df.sample(6) # random n=6 number of rows

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
233,Thailand,THA,2760751000.0,3034038000.0,3308913000.0,3540403000.0,3889130000.0,4388938000.0,5279231000.0,5638461000.0,...,420334000000.0,407339000000.0,401296000000.0,413366000000.0,456357000000.0,506754000000.0,543977000000.0,500457000000.0,505568000000.0,495423000000.0
211,El Salvador,SLV,,,,,,877720000.0,929520000.0,976200000.0,...,21990960000.0,22593470000.0,23438240000.0,24191430000.0,24979190000.0,26020850000.0,26881140000.0,24930080000.0,29451240000.0,32488720000.0
251,United States,USA,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,...,16843200000000.0,17550700000000.0,18206000000000.0,18695100000000.0,19477300000000.0,20533100000000.0,21381000000000.0,21060500000000.0,23315100000000.0,25439700000000.0
15,Azerbaijan,AZE,,,,,,,,,...,74160560000.0,75239790000.0,53076240000.0,37867000000.0,40866630000.0,47112470000.0,48174240000.0,42693000000.0,54825410000.0,78721060000.0
134,Latin America & Caribbean,LCN,,,,,,,,,...,6360880000000.0,6482210000000.0,5420470000000.0,5285690000000.0,5866300000000.0,5742580000000.0,5660900000000.0,4809520000000.0,5553340000000.0,6302490000000.0
244,Turkiye,TUR,7566667000.0,7988889000.0,8922222000.0,10355560000.0,11177780000.0,11966670000.0,14100000000.0,15644440000.0,...,957799000000.0,938935000000.0,864314000000.0,869683000000.0,858988000000.0,778972000000.0,761006000000.0,720338000000.0,819865000000.0,907118000000.0


In [19]:
df.shape # sample data's rows and columns (dimensions)

(266, 65)