# Pandas
1. Pandas
2. Data structures
3. Pandas Series Series
     - Creating Series 
     - Manipulating Series
4. Pandas Dataframes
    - Creating Dataframes
    - Manipulating Dataframes
5. Reading Data from different Sources

## 1. Pandas
- It contains data structures and manipulation tools designed for data cleaning and analysis. 
- While it adopts many idioms from _Numpy_, there biggest difference is that Pandas is desined for working with tabular ot hereogenious data.NumPy , by contrast , is best suited for working with homogeneous numerical array data.
- Its name is derieved from "Panel data" an econometrics term for multidimensional structured data sets.

### Pandas installation and import 
- installation 
`!pip install pandas`
- Import 
`import pandas as pd` 

In [3]:
# Import Pandas 
import pandas as pd

## 2. Data Structures
- Pandas has 2 data structures as follows:
1. A __Series__ is 1-dimensional labeled array that can hold data of any type (integer, float,string ,boolean,python object,and so on). Its axis labels are collectively called an index. 
2. A __DataFrame__ is a 2- dimensional labelled data structure with columns. it supports multiple datatypes.

## 3. Pandas Series 
- Is a one- dimensional labeled arrau capable of holding any data type. However,a series is a sequence of homogenoues data types, similar to an array , list , or column in a tabe.
- It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

### 3.1 Creating Series 
1. __To create a numeric series__.

In [4]:
# Create a numeric series  
numbers = range(1,100,5)
pd.Series(numbers)

0      1
1      6
2     11
3     16
4     21
5     26
6     31
7     36
8     41
9     46
10    51
11    56
12    61
13    66
14    71
15    76
16    81
17    86
18    91
19    96
dtype: int64

- The output is of type `int64`
- The row names are usually denoted as _"Index"_
2. __To create an object series__

In [5]:
string = "Hi" , "How " , "are " ,"you","?"
pd.Series(string)

0      Hi
1    How 
2    are 
3     you
4       ?
dtype: object

- Output is of type `object`
3. __To create a series with both numeric and string values__

In [6]:
# create a series with an arbituary list 
pd.Series([365,'London',34.5,-34.5,'Happy Birthday'])

0               365
1            London
2              34.5
3             -34.5
4    Happy Birthday
dtype: object

- Here numeric types are treated as objects. _A serie cannot have multiple data types so it defines all of them as an object_
4. To set index values for a series.

In [7]:
marks = [60,89,74,86,100]

subject = ["Math" , "System design" , "Cloud Computing" , "Data Analysis" , "React"]

marks_series= pd.Series(marks,index =subject)
marks_series


Math                60
System design       89
Cloud Computing     74
Data Analysis       86
React              100
dtype: int64

- the index is added usin the argument `index=`. The data tyoe if the series continues to be numeric.

6. __To create series from a dictionary__

In [8]:
data = {'React':90,"Node":85,"Flutter":50,"Django":75}
pd.Series(data)

React      90
Node       85
Flutter    50
Django     75
dtype: int64

- On passing a dict the index in the resulting Series will have the dict's keys in sorted oder.

6. __A series with missing values__
- If we pass a key that is not defined then its value will be `NAN`

In [9]:
subject = ["Math" , "System design" , "Cloud Computing" , "Data Analysis" , "React"] 

marks_series = pd.Series(data,index=subject)
marks_series

Math                NaN
System design       NaN
Cloud Computing     NaN
Data Analysis       NaN
React              90.0
dtype: float64

In [10]:
# Error
index=['Apple', 'Banana', 'Orange']
quantity = [34, 20, 30, 40]
# Uncomment to see error 👇🏿.
# pd.Series(data=quantity, index=index)

In [11]:
dict={'A':30, 'B':40, 'C':50}
index=['A', 'B', 'D']
pd.Series(data=dict, index=index)

A    30.0
B    40.0
D     NaN
dtype: float64

### 3.2 Manipulating Series
1. __To check for null values using `.notnull`__

In [12]:
marks_series.notnull()

Math               False
System design      False
Cloud Computing    False
Data Analysis      False
React               True
dtype: bool

- `True` indicates that the value is not null.
3. __To know the subjects in which marks score is more than 75__

In [13]:
marks_series[marks_series > 75]

React    90.0
dtype: float64

4. ___To assign 68 marks to 'Art and Craft'__

In [16]:
marks_series["Math"] = 91
# or
#  mark
marks_series

Math               91.0
System design       NaN
Cloud Computing     NaN
Data Analysis       NaN
React              91.0
dtype: float64

In [17]:
# Compare values
marks_series["Math"] == 75
# OR
marks_series.Math == 75

False

5. __Sorting Numeric Series__

In [18]:
import numpy as np

In [19]:
values = pd.Series([23,np.nan,45,np.nan,56,67,34,23])
values

0    23.0
1     NaN
2    45.0
3     NaN
4    56.0
5    67.0
6    34.0
7    23.0
dtype: float64

In [20]:
#ascending Order
values.sort_values(ascending = True)
#descending Order
values.sort_values(ascending = False)

5    67.0
4    56.0
2    45.0
6    34.0
0    23.0
7    23.0
1     NaN
3     NaN
dtype: float64

7. __Sorting Categorical Series__

In [21]:
# create a pandas series 
string_values = pd.Series(["a", "f", "j", "d", "c"])
string_values
# since the computer stores strings in lexigraphical order
# sort_values maintains the indices of all the elements of the array

0    a
1    f
2    j
3    d
4    c
dtype: object

In [22]:
# ascending order
string_values.sort_values(ascending=True)

0    a
4    c
3    d
1    f
2    j
dtype: object

In [23]:
data=range(10)
new_ser=pd.Series(data=data)
new_ser[new_ser==5]

5    5
dtype: int64

In [24]:
marks_series.rank(ascending=True)

Math               1.5
System design      NaN
Cloud Computing    NaN
Data Analysis      NaN
React              1.5
dtype: float64

# Panda Data Frames
- Is a tabular representation of data containing an ordered collectin, each of which can ve a different type (numeric , string,boolean and so on)
- The DataFrame has both row and column index; it can be thought of as a dict of Series all sharing the same index.In a DF, the data is stored as one or more two-dimensioanl blocks rather than a list,dict or some other collection of one-dimensional arrays.
- While a DF is physically two-dimensional, it can be use to represent higher dimensional data in  tabular   format usung hierarchical indexing

- __4.1 Creating a DataFrame__

1. __Creating a DataFrame from a dictionary__

In [25]:
data ={
    'Subject':["React","Rust","Golang","Elixir"],
    'Marks':(89,34,65,78), 
    'CGPA':[2.5,3.0,4.5,5.6]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Subject,Marks,CGPA
0,React,89,2.5
1,Rust,34,3.0
2,Golang,65,4.5
3,Elixir,78,5.6


__Note__: Like Series, the resulting DataFrame is assigned index automatically. ANd the "Marks" values are in a tuple.

2. __To Create a DataFrame From series__

In [3]:
Subject = pd.Series(["Math","DSA","Python","ML"])
Marks = pd.Series([23,23,23,23])
CGPA = pd.Series([2.5,3.0,4.0,6.0])

In [4]:
pd.DataFrame([Subject,Marks,CGPA],index = ['Subject','Marks','CGPA'])

Unnamed: 0,0,1,2,3
Subject,Math,DSA,Python,ML
Marks,23,23,23,23
CGPA,2.5,3.0,4.0,6.0


- However to want a vertical data frame we use `.T` . The 'T' stands for _transpose_.

In [5]:
pd.DataFrame([Subject,Marks,CGPA],index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Math,23,2.5
1,DSA,23,3.0
2,Python,23,4.0
3,ML,23,6.0


4. __To create a dataframe from lists__

In [6]:
Subject = ["Math","DSA","Python","ML"]
Marks = [23,23,23,23]
CGPA = [2.5,3.0,4.0,6.0]

pd.DataFrame([Subject,Marks,CGPA],index=['Subject','Marks','CGPA'])

Unnamed: 0,0,1,2,3
Subject,Math,DSA,Python,ML
Marks,23,23,23,23
CGPA,2.5,3.0,4.0,6.0


In [7]:
pd.DataFrame([Subject,Marks,CGPA],index=['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Math,23,2.5
1,DSA,23,3.0
2,Python,23,4.0
3,ML,23,6.0


In [18]:
a1 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
a2 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
a3 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
school = [a1, a2, a3]
inst = ['School_1', 'School_2', 'School_3']
Muggle_data = pd.DataFrame(data=school, columns=inst)
Muggle_data

Unnamed: 0,School_1,School_2,School_3
0,Hogwarts,Durmstrang,Beauxbatons
1,Hogwarts,Durmstrang,Beauxbatons
2,Hogwarts,Durmstrang,Beauxbatons


5. __To read data from csv file__


In [4]:
data = pd.read_csv("sample.csv")
type(data)
# The file must be in the same directory as the notebook.

pandas.core.frame.DataFrame

- On checking the data type it is a pandas Dataframe.
6. __To print the head of the data__

In [10]:
data.head() # 1st 5 elements

Unnamed: 0,name,world_ranking,region_ranking,country_ranking,region,country,city\state,acceptance_rate,publication,website,phone_no,address
0,Aarhus University,"#150 of 14,131","#44 of 2,785",#2 of 27,Europe,Italy,Veneto,15%,89633,www.au.dk,+45 8942 1111,"Nordre Ringgade 1\n Aarhus, Central Denmark Re..."
1,Arizona State University - Tempe,"#61 of 14,131","#48 of 2,597","#45 of 2,496",Europe,Spain,Valencia,,99086,www.asu.edu,8552785080,"University Drive and Mill Avenue\n Tempe, Ariz..."
2,Auburn University,,,,North America,United States,Wisconsin,57%,36231,auburn.edu,3348444000,"Samford Hall\n Auburn, Alabama, 36849 \nUnited..."
3,Australian National University,"#88 of 14,131",#5 of 59,#5 of 40,North America,Canada,Ontario,86%,97754,www.anu.edu.au,+61 (0)2 6125 5111,"Ellery Crescent, Acton\n Canberra, Australian ..."
4,Autonomous University of Barcelona,,,,Europe,Italy,Emilia-Romagna,11%,74922,www.uab.cat,+34 935812222,"Campus de Bellaterra, Edificio A\n Cerdanyola ..."


- By deafult, the `.head()` will display first five rows.However we can set the desired number of rows to be displayed. 
`data.head(9)`

7. __To print tail data__


In [11]:
data.tail() # Last 5

Unnamed: 0,name,world_ranking,region_ranking,country_ranking,region,country,city\state,acceptance_rate,publication,website,phone_no,address
295,Xi'an Jiaotong University,,,,North America,United States,Connecticut,7%,96097,www.xjtu.edu.cn,+86 (29) 266 8830,"28 Xianning Road\n Xi'an, Shaanxi, 710049 \nChina"
296,Yale University,"#11 of 14,131","#9 of 2,597","#9 of 2,496",Asia,China,Xi'an,,198095,www.yale.edu,2034324771,"Woodbridge Hall\n New Haven, Connecticut, 0652..."
297,Yonsei University,,,,North America,United States,Washington,56%,95497,www.yonsei.ac.kr,+82 (2) 2123 2114,"134 Sinchon-dong, Seodaemun-gu\n Seoul, Seoul,..."
298,York University,,,,Oceania,Australia,Canberra,33%,41257,yorku.ca,+1 (416) 736 5002,"4700 Keele Street\n Toronto, Ontario, M3J 1P3 ..."
299,Zhejiang University,"#109 of 14,131","#7 of 5,830",#3 of 960,North America,United States,Alabama,52%,176136,www.zju.edu.cn,+86 (571) 8795 1020,"38 Zheda Road, Xihu\n Hangzhou, Zhejiang, 3100..."


-  To obtain the dimensions of the data

In [12]:
data.shape
# (rows,columns)

(300, 12)

9. __To know the data types of the data frame__

In [13]:
data.dtypes
# Shows Data type of each variable

name               object
world_ranking      object
region_ranking     object
country_ranking    object
region             object
country            object
city\state         object
acceptance_rate    object
publication        object
website            object
phone_no           object
address            object
dtype: object

10. To know some information of the data

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             300 non-null    object
 1   world_ranking    199 non-null    object
 2   region_ranking   199 non-null    object
 3   country_ranking  199 non-null    object
 4   region           300 non-null    object
 5   country          300 non-null    object
 6   city\state       300 non-null    object
 7   acceptance_rate  224 non-null    object
 8   publication      300 non-null    object
 9   website          294 non-null    object
 10  phone_no         296 non-null    object
 11  address          300 non-null    object
dtypes: object(12)
memory usage: 28.2+ KB


- We see this output gives the number of rows present in the data `RangeIndex: 300 entries, 0 to 299` There are 23 rows numbered from 0 to 22. And there are a total of three columns - `Data columns (total 12 columns):`

Consider `0 name 300 non-null object` indicates that the column named 'Age' has 300 non-null observations having a datatype of `object` and finally the memory used to save this data frame is `memory usage: 28.2+ KB`

11. __To check the data type of column in the data frame__

In [15]:
type(data.name)

pandas.core.series.Series

In [16]:
type(data["publication"])

pandas.core.series.Series

- Note that every column on the data frame is a panda Series

# 4.2 Manipulating DataFrames 

1. __Adding new column to data set__

In [None]:
data["BMI"] = data["weight"] / data["height"]**2

2. __Add a new row in a data set__
- A new row can be added using the `.copy()`.

In [None]:
# In the case of tuples , when index is changed .. it changes the ordering of original data frames.
# create a copy
data_copy = data.copy()
# Add a new row at index and add infomation in the 
data_copy.loc[301] = ["Zhejiang University"	]

- We see that the new colmn number 23 has beeb added to the dat.

3. __Indexing a dataframe using `.iloc()`__
- `DataFrame.iloc[]` method is used when the index label of the data frame is something other than numeric series of 0,1,2,3...n or incase the user doesnt know the index label.

we shall work on the BMI data-set

__Select the second row__

In [8]:
data.iloc[2]

name                                               Auburn University
world_ranking                                                    NaN
region_ranking                                                   NaN
country_ranking                                                  NaN
region                                                 North America
country                                                United States
city\state                                                 Wisconsin
acceptance_rate                                                  57%
publication                                                   36,231
website                                                   auburn.edu
phone_no                                                  3348444000
address            Samford Hall\n Auburn, Alabama, 36849 \nUnited...
Name: 2, dtype: object

__Slect 4th,7th and 10th rows__

In [11]:
# Creates a sub data frame from the list passed to it 
data.iloc[[4,7]]

Unnamed: 0,name,world_ranking,region_ranking,country_ranking,region,country,city\state,acceptance_rate,publication,website,phone_no,address
4,Autonomous University of Barcelona,,,,Europe,Italy,Emilia-Romagna,11%,74922,www.uab.cat,+34 935812222,"Campus de Bellaterra, Edificio A\n Cerdanyola ..."
7,Boston University,"#49 of 14,131","#42 of 2,597","#39 of 2,496",Europe,Greece,Athens,,107676,bu.edu,6173532000,"One Silber Way\n Boston, Massachusetts, 02215 ..."


- We use square brakets since we are passing a list of row numbers to be accessed.

__Select 2nd to 5th rows__

In [12]:
data.iloc[2:5]

Unnamed: 0,name,world_ranking,region_ranking,country_ranking,region,country,city\state,acceptance_rate,publication,website,phone_no,address
2,Auburn University,,,,North America,United States,Wisconsin,57%,36231,auburn.edu,3348444000,"Samford Hall\n Auburn, Alabama, 36849 \nUnited..."
3,Australian National University,"#88 of 14,131",#5 of 59,#5 of 40,North America,Canada,Ontario,86%,97754,www.anu.edu.au,+61 (0)2 6125 5111,"Ellery Crescent, Acton\n Canberra, Australian ..."
4,Autonomous University of Barcelona,,,,Europe,Italy,Emilia-Romagna,11%,74922,www.uab.cat,+34 935812222,"Campus de Bellaterra, Edificio A\n Cerdanyola ..."


- __Select First Column__

In [15]:
# going after the first column
data.iloc[:,0]

0                       Aarhus University
1        Arizona State University - Tempe
2                       Auburn University
3          Australian National University
4      Autonomous University of Barcelona
                      ...                
295             Xi'an Jiaotong University
296                       Yale University
297                     Yonsei University
298                       York University
299                   Zhejiang University
Name: name, Length: 300, dtype: object

In [16]:
# Select last column
data.iloc[:,-1]

0      Nordre Ringgade 1\n Aarhus, Central Denmark Re...
1      University Drive and Mill Avenue\n Tempe, Ariz...
2      Samford Hall\n Auburn, Alabama, 36849 \nUnited...
3      Ellery Crescent, Acton\n Canberra, Australian ...
4      Campus de Bellaterra, Edificio A\n Cerdanyola ...
                             ...                        
295    28 Xianning Road\n Xi'an, Shaanxi, 710049 \nChina
296    Woodbridge Hall\n New Haven, Connecticut, 0652...
297    134 Sinchon-dong, Seodaemun-gu\n Seoul, Seoul,...
298    4700 Keele Street\n Toronto, Ontario, M3J 1P3 ...
299    38 Zheda Road, Xihu\n Hangzhou, Zhejiang, 3100...
Name: address, Length: 300, dtype: object

In [17]:
# Select 2  columns
data.iloc[:,0:1]

Unnamed: 0,name
0,Aarhus University
1,Arizona State University - Tempe
2,Auburn University
3,Australian National University
4,Autonomous University of Barcelona
...,...
295,Xi'an Jiaotong University
296,Yale University
297,Yonsei University
298,York University


__Selecting the first 2 columns and first 5 cells__

In [18]:
data.iloc[5:7,0:2]

Unnamed: 0,name,world_ranking
5,Baylor College of Medicine,
6,Boston College,


4. __Indexing a dataframe using `.loc`__