# Unit 2 - Reading, describing, and selecting data
---

1. [Reading files](#section1)
2. [Describe the data](#section2)
3. [Selecting columns](#section3)
4. [Conditional selection](#section4)
5. [loc and iloc](#section5)
   


<a id='section1'></a>

## 1. Reading files
---

<div>
<img src="https://github.com/nlihin/data-analytics/blob/main/images/reading%20pandas.png?raw=true" width="400"/>
<figcaption align="center">
    <small> Image generated by gencraft using ChatGPT text </small>
</figcaption>
</div>

We will read the file using pandas\
Pandas knows to read csv files, as while as other file types\
[More details](https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table)

In [None]:
import pandas as pd

We will read data on [reported aircraft wildlife strike in the U.S](https://wildlife.faa.gov/search)\
I downloaded the file from the FAA and uploaded it to my Github. 

In [None]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes.csv'
strike_df = pd.read_csv(url)

View the first few rows:

In [None]:
strike_df.head(7)

read_csv has about 30 different options. See the 
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

For example, sep='\t' is used for tab delimited files and 'usecol' reads only specific columns. 

### Basic information

In [None]:
type(strike_df)

dataset length

In [None]:
len(strike_df)

view the shape of the dataframe:

In [None]:
strike_df.shape

view basic information:

In [None]:
strike_df.info()

In [None]:
strike_df.columns

We can view unique values:

In [None]:
strike_df['Incident Year'].unique()

### <span style="color:blue"> Exercise:</span>
> What do you think that the `tail()` command does? Try it out!
>
> What happens if we just type `strike_df`, without a head or a tail?
>
> Find what kinds of `Flight Impact` are in the dataset

---
A summary of the functions so far:

>* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
>* `type` - The data type 
>* `.info()` - View basic information about rows, columns & data types
>* `.columns` - Get the list of column names
>* `.shape` - Get the number of rows & columns as a tuple
>* `.head()`, `.tail()` - View the beginning/end of the file
>* `.unique()` - view the unique entries in a series (not a dataframe) object
>* `len` - dataframe length (number of rows)


<a id='section3'></a>

<a id='section2'></a>
## 2. Describe the data

### Describe **numeric** data

In [None]:
strike_df.describe()

a note on e: e stands for exponent of 10, and it's always followed by another number, which is the value of the exponent

5e + 1 = 5 X 10^1 = 50

6e + 2 = 6 X 10^2 = 600



### Describe **categorical** data:

In [None]:
strike_df.describe(include = 'object')

In [None]:
strike_df[['Species Name']].describe()

Count how many times each species is mentioned in the dataset

In [None]:
strike_df['Species Name'].value_counts()

The returned object is a Series  
Use `index` and `values` to access the index and values of the series that returned

---
A summary of the functions in this unit:

>* `describe()` - View statistical information for numerical data
>* `describe(include = 'object')` - for categorical data
>* `value_counts()` - counts how many times a certain value appears in a column
>* `index`, `values` - access the index and values of a series




### <span style="color:blue"> Exercise:</span>
> Now you:
>
> What `Operator` has the most incidents?  ** Can you output ONLY the operator name?
>
> How many operators are there? **output ONLY a number

---

<a id='section3'></a>
## 3. Selecting columns

#### Selecting a single column:

**return a single column as a series:**

In [None]:
strike_df.Airport

Note: using the `.` notation is possible only for columns whose names do not contain spaces or special characters.\
In all other cases, use []

In [None]:
type(strike_df["Flight Impact"])

**return a single column as a dataframe:**

In [None]:
type(strike_df[["Flight Impact"]])

#### Selecting a few columns 

In [None]:
strike_df[['Incident Year','Airport']].head()

#### Select a specific cell: 


In [None]:
strike_df.Airport[70000]

---
### <span style="color:blue"> Exercise:</span>
> Now you:
>
> Select one column from the `strike_df` dataframe
>
> What is the difference between `strike_df[['Airport']]` and `strike_df['Airport']` and `strike_df.Airport` ?  Use `type` to find out


---

<a id='section4'></a>
## 4. Conditional selection

Our condition: only data on incidents involoving United Airlines

In [None]:
strike_df['Operator'] == 'UNITED AIRLINES'

This creates a series of true/false 

We can insert this into the dataframe to select only that task:

In [None]:
strike_df[strike_df['Operator'] == 'UNITED AIRLINES']

### Select on more than one condition

In [None]:
strike_df[(strike_df.Operator == 'UNITED AIRLINES')| (strike_df.Operator == 'DELTA AIR LINES') | (strike_df.Operator == 'AMERICAN AIRLINES')]

**A more efficient way to do this:**

In [None]:
strike_df[strike_df['Operator'].isin(['UNITED AIRLINES', 'DELTA AIR LINES','AMERICAN AIRLINES'])]

In [None]:
strike_df[(strike_df.Operator == 'UNITED AIRLINES') & (strike_df['Incident Year'] == 2015)]

# You can also use Query:
#strike_df.query("Operator == 'UNITED AIRLINES' and `Incident Year` == 2015")

### <span style="color:blue"> Exercise:</span>
> Select all the incidents of `Species Name`:MOURNING DOVE on 2014
>
> How many incidents of this type are there?

### Selection using a variable

Find the incidents that occured in the maximum height 

In [None]:
max_height = strike_df['Height'].max()
max_height

In [None]:
strike_df[(strike_df['Height'] == max_height)]

Find the incidents that occured in the most frequent speed

In [None]:
most_frequent_speed = strike_df['Speed'].mode()
most_frequent_speed

Note: if there are multiple modes (a tie), `mode()` returns a Series containing all the values that appear with the highest frequency.  
You can extract the first mode using `.values[0]`

In [None]:
most_frequent_speed = strike_df['Speed'].mode().values[0]
most_frequent_speed

In [None]:
strike_df[(strike_df['Speed'] == most_frequent_speed)]

### <span style="color:blue"> Exercise:</span>
> Select ONLY the Operator and Aircraft of the incident at the maximum height
>
> How many incidents occured for `Operator`: UNITED AIRLINES during `Visibility`: DAY?
>
> ** Select all incidents that did NOT occur in 2010
>
> ** What is the mean (average) speed at which incidents occured for `Operator`: UNITED AIRLINES during `Visibility`: DAY?
>

<a id='section5'></a>
## 5. `loc` and `iloc` 

One way to do that is iloc. 

`.iloc` - selects subsets of rows and columns by integer location only

In [None]:
#strike_df.iloc[[70000]]  #first row as a series
strike_df.iloc[[5,50,50000]]  #first row as a dataframe
strike_df.iloc[[-1]] #last row as a series
strike_df.iloc[114:111,[0,5]] #last row as a dataframe

The : operator 

 - when used alone it means "everything"

- also used to indicate a ***slice*** of values


In [None]:
strike_df.iloc[2:4] # second and third row
#strike_df.iloc[[-1,2,22]] #a few specific rows

# Columns:
strike_df.iloc[:,0] # first column of data frame  
strike_df.iloc[:,[1,2]] # second column of data frame  
#strike_df.iloc[:,-1] # last column of data frame

#Rows and columns
#strike_df.iloc[0:5] # first five rows of dataframe
#strike_df.iloc[4:6, 0:2] # first two columns of data frame with all rows
#strike_df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

**What if I want to select the `Airport` column, but I don't remember the column number?**

Use `.loc`

`.loc` - selects subsets of rows and columns by label only. Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

- A list or array of labels, e.g. ['a', 'b', 'c'].

- A slice object with labels, e.g. 'a':'f'.

In [None]:
strike_df.loc[2:4,['Airport','Incident Year']]

Find all the airport that had incidents at the minimum height, and the year of occurence

In [None]:
min_height = strike_df['Height'].min()

strike_df.loc[strike_df['Height'] == min_height, ['Airport', 'Incident Year','Height']]


#optimized version -using Query. Faster on BIG data
#result = strike_df.query("Height == @min_height")[['Airport', 'Incident Year']]

Semantics are similar to iloc. But note:

- `iloc` excludes the last element.  `df.iloc[0:1000]` will return entries 0...999
- `loc`, includes the last element.  `df.loc[0:1000]` will return entries 0...1000

you try it! What is the difference between:

> strike_df.iloc[0:5]

> strike_df.loc[0:5]

---
### <span style="color:blue"> Exercise:</span>
>
> Select all the cases of `Airport` that had incidents in heights above 20000
>
>* Print a list of these airports. No duplicates!
>
>* How many such airports are there? (no duplicates)
>  
>
> Select all the cases of `Airport` that had incidents in heights above 20000 - select on the 6th on this list
>
---


---
A summary of the functions in this unit:

>* `.iloc` - selects rows and columns by integer location
>* `.loc` - selects rows and columns by label location



Note: indexing operators as the ones working on dictionaries, will also work in pandas. But for more advanced operations, better get used to loc and iloc.

---

<a id='section4'></a>

<a id='section4'></a>