# Tech #5 Selecting, Filtering, and Creating Data

- This file illustrates how to select, filter, and create data.

**Import pandas package**

In [1]:
import pandas as pd

**Load the dataset**

In [2]:
df19 = pd.read_csv('Compustat_fy2019.csv', parse_dates = ['datadate'])

  df19 = pd.read_csv('Compustat_fy2019.csv', parse_dates = ['datadate'])


**Navigate the dataset**

In [3]:
df19.head()

Unnamed: 0,tic,conm,datadate,fyear,at,lt,teq,revt,ni,exchg
0,AIR,AAR CORP,2020-05-31,2019,2079.0,1176.4,902.6,2089.3,4.4,11
1,AAL,AMERICAN AIRLINES GROUP INC,2019-12-31,2019,59995.0,60113.0,-118.0,45768.0,1686.0,14
2,CECE,CECO ENVIRONMENTAL CORP,2019-12-31,2019,408.637,215.62,193.017,341.869,17.707,14
3,ASA,ASA GOLD AND PRECIOUS METALS,2019-11-30,2019,286.612,0.733,285.879,2.371,91.431,11
4,PNW,PINNACLE WEST CAPITAL CORP,2019-12-31,2019,18479.247,12926.059,5553.188,3471.209,538.32,11


---
## Selecting Data

### Selecting one variable (column)
```Python
df19['at']
```
- Specify the variable name with quotes around it in the brackets.

In [5]:
df19['lt']

0        1176.400
1       60113.000
2         215.620
3           0.733
4       12926.059
          ...    
4722        1.149
4723       17.217
4724        7.592
4725       10.858
4726      149.781
Name: lt, Length: 4727, dtype: float64

### Selecting multiple variables (columns)
```Python
df19[['conm','at']]
```
- Specify a list of variable names (using brackets) in the brackets.

In [7]:
df19[['conm','lt']]

Unnamed: 0,conm,lt
0,AAR CORP,1176.400
1,AMERICAN AIRLINES GROUP INC,60113.000
2,CECO ENVIRONMENTAL CORP,215.620
3,ASA GOLD AND PRECIOUS METALS,0.733
4,PINNACLE WEST CAPITAL CORP,12926.059
...,...,...
4722,RENALYTIX AI PLC,1.149
4723,CASTOR MARITIME INC,17.217
4724,IMMUNIC INC,7.592
4725,ARMATA PHARMACEUTICALS INC,10.858


**Note: These steps are executed only to give you the results temporarily. They are not performed on the df19 dataset at all.**
The results are displayed and Python immediately forgets them. If you want to save what you select as a new dataframe, you should assign it a new reference name using the code below:

### Subsetting your dataframe and saving that as a new dataframe
```Python
df19_1 = df19[['conm','at']].copy(deep=True)
```
The name you pick should be different; a good idea is to create a new number and increase it by one. If you save it to the same name, you overwrite your initial dataframe.

### Selecting observations (rows)
```Python
df19[0:10]
```
- Selecting from the first observation to the tenth observation. This uses the INDEX (starts at 0).
- `df[a:b]` will give you observations with index between a and **b-1**.
- If you don't specify the number, the default is either zero in position `a` or the largest index in postion `b`. So `df[0:10]` is equivalent to `df[:10]`.

### Selecting both observations and variables at the same time
```Python
df19[0:10][['conm','at']]
```

---
## Filtering Data

### Filtering data on one condition
```Python
df19[df19['exchg']==14]
```
- Selecting companies that are listed in NASDAQ by specifying the corresponding condition (i.e., `df['exchg']==14`) in the brackets.
- Python comparison operators
    - Equal (==)
    - Not equal (!= or <>)
    - Greater (>)
    - Greater or equal (>=)
    - Smaller (<)
    - Smaller or equal (<=)

### Reminder: if you want to save your filtered dataframe to new separate dataframe, you should use:
```Python
df19_NASDAQ = df19[df19['exchg']==14].copy(deep=True)
```

### Filtering data on multiple conditions
```Python
df19[(df19['exchg']==14) & (df19['at']>=5000)]
```
- Selecting companies that are listed in NASDAQ and have at least 5 billion total equities.
- You need to put parentheses around each condition, and then connect these conditions with bitwise operators. 
- Python bitwise operators
    - And (&)
    - Or ( | )

- You can also write a new code to specify the condition separately and then refer to it in the function
```Python
condition = (df19['exchg']==14) & (df19['at']>=5000)
df19[condition]
```


---
## Creating Data

### Creating a new variable
```Python
df19['roa'] = df19['ni'] / df19['at']
```
- Creating a new variable called roa by dividing net incomes with total assets.
- Reminder: if you have filtered or subsetted your dataframe, you must use the new dataframe references (names) all throughout this equation.