<a href="https://colab.research.google.com/github/hewp84/CRT420/blob/main/Pandas_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PANDAS: Views and Selection

## Viewing data

### Head and Tail Methods

The head() and tail() methods allow us to view a subset of rows from the top or bottom of a DataFrame.

Some key points about head() and tail():

* The default is 5 rows if N is not provided.
* Helpful for quickly checking/previewing data.
* More efficient than loading the full DataFrame.
* Works on any Pandas object like Series too.

In [89]:
import pandas as pd
import numpy as np

In [91]:
import pandas as pd

data = {'Name': ['Ronald', 'Brock', 'Liz', 'Hector', 'Lionel', 'Gabriel', 'Aaron', 'Ben'],
        'Age': [25, 27, 31, 19, 34, 22, 20, 29]}

df = pd.DataFrame(data)

In [93]:
df.head(6)

Unnamed: 0,Name,Age
0,Ronald,25
1,Brock,27
2,Liz,31
3,Hector,19
4,Lionel,34
5,Gabriel,22


In [100]:
df.sort_values(by=['Age']).tail()

Unnamed: 0,Name,Age
0,Ronald,25
1,Brock,27
7,Ben,29
2,Liz,31
4,Lionel,34


In [None]:
df_world = pd.read_csv('world-data-2023.csv')

In [None]:
df_world.head(10)

In [104]:
#Try different parameters for the given methods...
df_world.sort_values(by=['Population'], ascending=True).tail()

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
191,Vietnam,314,VN,39.30%,331210.0,522000.0,16.75,84.0,Hanoi,192668.0,...,43.50%,0.82,96462106.0,77.40%,19.10%,37.60%,2.01%,35332140.0,14.058324,108.277199
5,Antigua and Barbuda,223,AG,20.50%,443.0,0.0,15.33,1.0,"St. John's, Saint John",557.0,...,24.30%,2.76,97118.0,,16.50%,43.00%,,23800.0,17.060816,-61.796428
154,Seychelles,214,SC,3.40%,455.0,0.0,17.1,248.0,"Victoria, Seychelles",605.0,...,2.50%,0.95,97625.0,,34.10%,30.10%,,55762.0,-4.679574,55.491977
47,Djibouti,43,DJ,73.40%,23200.0,13000.0,21.47,253.0,Djibouti City,620.0,...,20.40%,0.22,973560.0,60.20%,,37.90%,10.30%,758549.0,11.825138,42.590275
133,Palestinian National Authority,847,,,,,,,,,...,,,,,,,,,31.952162,35.233154


### Describe Method

The describe() method generates descriptive statistics for the DataFrame. It provides different output for numeric and object/categorical columns.

For quantitative data, describe() includes:

* count - The number of non-NaN values in each column.
* mean - The average value for each column.
* std - The standard deviation, measuring how dispersed the values are from the mean.
* min - The minimum value in each column.
* 25% - The 25th percentile, where 25% of values are lower.
* 50% - The median or 50th percentile. The midpoint of the values.
* 75% - The 75th percentile, where 75% of the values are lower.
* max - The maximum value in each column.

Syntax: `DataFrame.describe(percentiles=None, include=None, exclude=None) `

In [106]:
df_sample = pd.read_csv('sample.csv')
df_sample.describe()

Unnamed: 0,id,age,weight,gender,height
count,69.0,69.0,69.0,69.0,69.0
mean,34.0,59.550725,188.072464,1.434783,200.695652
std,20.062403,34.956501,99.470261,0.49936,53.319046
min,0.0,3.0,53.0,1.0,110.0
25%,17.0,37.0,95.0,1.0,158.0
50%,34.0,66.0,178.0,1.0,200.0
75%,51.0,92.0,288.0,2.0,247.0
max,68.0,116.0,350.0,2.0,294.0


In [108]:
df_world['Official language']

0          Pashto
1        Albanian
2          Arabic
3         Catalan
4      Portuguese
          ...    
190       Spanish
191    Vietnamese
192        Arabic
193       English
194         Shona
Name: Official language, Length: 195, dtype: object

In [111]:
df_world['Official language'].describe()

count         194
unique         77
top       English
freq           31
Name: Official language, dtype: object

## Selection

#### Bracket Notation []

Select columns:

* Single bracket selects one column as a Series
* Double bracket passes a list of column names to select multiple columns

In [None]:
df['A'] # Single column as Series
df[['A', 'B']] # Multiple columns as DataFrame

In [117]:
df_world[['Country','Abbreviation', 'Birth Rate']]

Unnamed: 0,Country,Abbreviation,Birth Rate
0,Afghanistan,AF,32.49
1,Albania,AL,11.78
2,Algeria,DZ,24.28
3,Andorra,AD,7.20
4,Angola,AO,40.73
...,...,...,...
190,Venezuela,VE,17.88
191,Vietnam,VN,16.75
192,Yemen,YE,30.45
193,Zambia,ZM,36.19


#### loc Attribute

Select rows and columns by label:

* Single label selects one row
* Slice with ':' to select a range of rows
* ':' alone selects all rows

In [None]:
df.loc[1] # Single row by label 
df.loc[1:5] # Slice of rows
df.loc[:, 'A'] # All rows, single column

In [129]:
df_world.loc[23:34:2, ['Country','Abbreviation']]

Unnamed: 0,Country,Abbreviation
23,Brazil,BR
25,Bulgaria,BG
27,Burundi,BI
29,Cape Verde,CV
31,Cameroon,CM
33,Central African Republic,CF


#### iloc Attribute

Select rows and columns by integer position:

* Single integer selects one row/column
* Slice with ':' to select a range
* ':' alone selects all rows/columns

In [None]:
df.iloc[1] # Row by integer position
df.iloc[1:5, 0:2] # Slice of rows and columns 
df.iloc[:, 2] # All rows, single column

In [136]:
df_world.iloc[4:10, [2,3,7]]

Unnamed: 0,Abbreviation,Agricultural Land( %),Calling Code
4,AO,47.50%,244.0
5,AG,20.50%,1.0
6,AR,54.30%,54.0
7,AM,58.90%,374.0
8,AU,48.20%,61.0
9,AT,32.40%,43.0


#### Boolean Masking

Select rows where boolean condition is met:

In [None]:
df[df['A'] > 0] # Rows where column A is positive

In [139]:
df_world['Fertility Rate'].describe()
df_world[df_world['Fertility Rate'] > 4]

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,323000.0,32.49,93.0,Kabul,8672,...,78.40%,0.28,38041754,48.90%,9.30%,71.40%,11.12%,9797273,33.93911,67.709953
4,Angola,26,AO,47.50%,1246700,117000.0,40.73,244.0,Luanda,34693,...,33.40%,0.21,31825295,77.50%,9.20%,49.10%,6.89%,21061025,-11.202692,17.873887
18,Benin,108,BJ,33.30%,112622,12000.0,36.22,229.0,Porto-Novo,6476,...,40.50%,0.08,11801151,70.90%,10.80%,48.90%,2.23%,5648149,9.30769,2.315834
26,Burkina Faso,76,BF,44.20%,274200,11000.0,37.93,226.0,Ouagadougou,3418,...,36.10%,0.08,20321378,66.40%,15.00%,41.30%,6.26%,6092349,12.238333,-1.561593
27,Burundi,463,BI,79.20%,27830,31000.0,39.01,257.0,Bujumbura,495,...,19.10%,0.1,11530580,79.20%,13.60%,41.20%,1.43%,1541177,-3.373056,29.918886
28,Ivory Coast,83,CI,64.80%,322463,27000.0,35.74,225.0,Yamoussoukro,9674,...,36.00%,0.23,25716544,57.00%,11.80%,50.10%,3.32%,13176900,7.539989,-5.54708
31,Cameroon,56,CM,20.60%,475440,24000.0,35.39,237.0,Yaound�,8291,...,69.70%,0.09,25876380,76.10%,12.80%,57.70%,3.38%,14741256,7.369722,12.354722
33,Central African Republic,8,CF,8.20%,622984,8000.0,35.35,236.0,Bangui,297,...,39.60%,0.06,4745185,72.00%,8.60%,73.30%,3.68%,1982064,6.611111,20.939444
34,Chad,13,TD,39.70%,1284000,35000.0,42.17,235.0,N'Djamena,1016,...,56.40%,0.04,15946876,70.70%,,63.50%,1.89%,3712273,15.454166,18.732207
38,Comoros,467,KM,71.50%,2235,,31.88,269.0,"Moroni, Comoros",202,...,74.80%,0.27,850886,43.30%,,219.60%,4.34%,248152,-11.6455,43.3333


In [141]:
df_world[df_world['Fertility Rate'] > 4].iloc[1:5,0:2]

Unnamed: 0,Country,Density\n(P/Km2)
4,Angola,26
18,Benin,108
26,Burkina Faso,76
27,Burundi,463
