# Tech #5 Selecting, Filtering, and Creating Data

- This file illustrates how to select, filter, and create data.

**Import pandas package**

In [1]:
import pandas as pd

**Load the dataset**

In [2]:
df19 = pd.read_csv('Compustat_fy2019.csv', parse_dates = ['datadate'])

  df19 = pd.read_csv('Compustat_fy2019.csv', parse_dates = ['datadate'])


**Navigate the dataset**

In [3]:
df19.head()

Unnamed: 0,tic,conm,datadate,fyear,at,lt,teq,revt,ni,exchg
0,AIR,AAR CORP,2020-05-31,2019,2079.0,1176.4,902.6,2089.3,4.4,11
1,AAL,AMERICAN AIRLINES GROUP INC,2019-12-31,2019,59995.0,60113.0,-118.0,45768.0,1686.0,14
2,CECE,CECO ENVIRONMENTAL CORP,2019-12-31,2019,408.637,215.62,193.017,341.869,17.707,14
3,ASA,ASA GOLD AND PRECIOUS METALS,2019-11-30,2019,286.612,0.733,285.879,2.371,91.431,11
4,PNW,PINNACLE WEST CAPITAL CORP,2019-12-31,2019,18479.247,12926.059,5553.188,3471.209,538.32,11


---
## Selecting Data

### Selecting one variable (column)
```Python
df19['at']
```
- Specify the variable name with quotes around it in the brackets.

In [18]:
df19['at']

0        2079.000
1       59995.000
2         408.637
3         286.612
4       18479.247
          ...    
4722        9.700
4723       30.421
4724       65.955
4725       25.451
4726      201.909
Name: at, Length: 4727, dtype: float64

### Selecting multiple variables (columns)
```Python
df19[['conm','at']]
```
- Specify a list of variable names (using brackets) in the brackets.

In [19]:
df19[['conm','at']]

Unnamed: 0,conm,at
0,AAR CORP,2079.000
1,AMERICAN AIRLINES GROUP INC,59995.000
2,CECO ENVIRONMENTAL CORP,408.637
3,ASA GOLD AND PRECIOUS METALS,286.612
4,PINNACLE WEST CAPITAL CORP,18479.247
...,...,...
4722,RENALYTIX AI PLC,9.700
4723,CASTOR MARITIME INC,30.421
4724,IMMUNIC INC,65.955
4725,ARMATA PHARMACEUTICALS INC,25.451


**Note: These steps are executed only to give you the results temporarily. They are not performed on the df19 dataset at all.**
The results are displayed and Python immediately forgets them. If you want to save what you select as a new dataframe, you should assign it a new reference name using the code below:

### Subsetting your dataframe and saving that as a new dataframe
```Python
df19_1 = df19[['conm','at']].copy(deep=True)
```
The name you pick should be different; a good idea is to create a new number and increase it by one. If you save it to the same name, you overwrite your initial dataframe.

In [21]:
df19_1 = df19[['conm','at']]
df19_1

Unnamed: 0,conm,at
0,AAR CORP,2079.000
1,AMERICAN AIRLINES GROUP INC,59995.000
2,CECO ENVIRONMENTAL CORP,408.637
3,ASA GOLD AND PRECIOUS METALS,286.612
4,PINNACLE WEST CAPITAL CORP,18479.247
...,...,...
4722,RENALYTIX AI PLC,9.700
4723,CASTOR MARITIME INC,30.421
4724,IMMUNIC INC,65.955
4725,ARMATA PHARMACEUTICALS INC,25.451


In [11]:
df19_2 = df19[['conm','at']]
df19_2

Unnamed: 0,conm,at
0,AAR CORP,2079.000
1,AMERICAN AIRLINES GROUP INC,59995.000
2,CECO ENVIRONMENTAL CORP,408.637
3,ASA GOLD AND PRECIOUS METALS,286.612
4,PINNACLE WEST CAPITAL CORP,18479.247
...,...,...
4722,RENALYTIX AI PLC,9.700
4723,CASTOR MARITIME INC,30.421
4724,IMMUNIC INC,65.955
4725,ARMATA PHARMACEUTICALS INC,25.451


### Selecting observations (rows)
```Python
df19[0:10]
```
- Selecting from the first observation to the tenth observation. This uses the INDEX (starts at 0).
- `df[a:b]` will give you observations with index between a and **b-1**.
- If you don't specify the number, the default is either zero in position `a` or the largest index in postion `b`. So `df[0:10]` is equivalent to `df[:10]`.

In [12]:
df19[0:10]

Unnamed: 0,tic,conm,datadate,fyear,at,lt,teq,revt,ni,exchg
0,AIR,AAR CORP,2020-05-31,2019,2079.0,1176.4,902.6,2089.3,4.4,11
1,AAL,AMERICAN AIRLINES GROUP INC,2019-12-31,2019,59995.0,60113.0,-118.0,45768.0,1686.0,14
2,CECE,CECO ENVIRONMENTAL CORP,2019-12-31,2019,408.637,215.62,193.017,341.869,17.707,14
3,ASA,ASA GOLD AND PRECIOUS METALS,2019-11-30,2019,286.612,0.733,285.879,2.371,91.431,11
4,PNW,PINNACLE WEST CAPITAL CORP,2019-12-31,2019,18479.247,12926.059,5553.188,3471.209,538.32,11
5,AAN,AARON'S INC,2019-12-31,2019,3297.8,1560.541,1737.259,3947.656,31.472,11
6,ABT,ABBOTT LABORATORIES,2019-12-31,2019,67887.0,36586.0,31301.0,31904.0,3687.0,11
7,ACU,ACME UNITED CORP,2019-12-31,2019,110.749,55.044,55.705,142.457,5.514,12
8,BKTI,BK TECHNOLOGIES CORP,2019-12-31,2019,37.94,14.664,23.276,40.1,-2.636,12
9,AE,ADAMS RESOURCES & ENERGY INC,2019-12-31,2019,330.842,179.201,151.641,1811.247,8.207,12


### Selecting both observations and variables at the same time
```Python
df19[0:10][['conm','at']]
```

In [13]:
df19[0:10][['conm','at']]

Unnamed: 0,conm,at
0,AAR CORP,2079.0
1,AMERICAN AIRLINES GROUP INC,59995.0
2,CECO ENVIRONMENTAL CORP,408.637
3,ASA GOLD AND PRECIOUS METALS,286.612
4,PINNACLE WEST CAPITAL CORP,18479.247
5,AARON'S INC,3297.8
6,ABBOTT LABORATORIES,67887.0
7,ACME UNITED CORP,110.749
8,BK TECHNOLOGIES CORP,37.94
9,ADAMS RESOURCES & ENERGY INC,330.842


---
## Filtering Data

### Filtering data on one condition
```Python
df19[df19['exchg']==14]
```
- Selecting companies that are listed in NASDAQ by specifying the corresponding condition (i.e., `df['exchg']==14`) in the brackets.
- Python comparison operators
    - Equal (==)
    - Not equal (!= or <>)
    - Greater (>)
    - Greater or equal (>=)
    - Smaller (<)
    - Smaller or equal (<=)

In [14]:
df19[df19['exchg']==14]

Unnamed: 0,tic,conm,datadate,fyear,at,lt,teq,revt,ni,exchg
1,AAL,AMERICAN AIRLINES GROUP INC,2019-12-31,2019,59995.000,60113.000,-118.000,45768.000,1686.000,14
2,CECE,CECO ENVIRONMENTAL CORP,2019-12-31,2019,408.637,215.620,193.017,341.869,17.707,14
10,AMD,ADVANCED MICRO DEVICES,2019-12-31,2019,6028.000,3201.000,2827.000,6731.000,341.000,14
13,AIRT,AIR T INC,2020-03-31,2019,151.427,120.336,25.011,236.785,7.656,14
15,ATRI,ATRION CORP,2019-12-31,2019,262.031,24.161,237.870,155.066,36.761,14
...,...,...,...,...,...,...,...,...,...,...
4719,CALT,CALLIDITAS THERAPE,2019-12-31,2019,90.473,6.115,84.358,19.785,-3.487,14
4720,NMCI,NAVIOS MARITIME CONTAINERS,2019-12-31,2019,460.302,270.322,189.980,141.532,7.507,14
4722,RNLX,RENALYTIX AI PLC,2019-06-30,2019,9.700,1.149,8.551,0.000,-42.301,14
4723,CTRM,CASTOR MARITIME INC,2019-12-31,2019,30.421,17.217,13.204,5.968,1.088,14


### Reminder: if you want to save your filtered dataframe to new separate dataframe, you should use:
```Python
df19_NASDAQ = df19[df19['exchg']==14].copy(deep=True)
```

In [22]:
df19_NASDAQ = df19[df19['exchg']==14]
df19_NASDAQ

Unnamed: 0,tic,conm,datadate,fyear,at,lt,teq,revt,ni,exchg
1,AAL,AMERICAN AIRLINES GROUP INC,2019-12-31,2019,59995.000,60113.000,-118.000,45768.000,1686.000,14
2,CECE,CECO ENVIRONMENTAL CORP,2019-12-31,2019,408.637,215.620,193.017,341.869,17.707,14
10,AMD,ADVANCED MICRO DEVICES,2019-12-31,2019,6028.000,3201.000,2827.000,6731.000,341.000,14
13,AIRT,AIR T INC,2020-03-31,2019,151.427,120.336,25.011,236.785,7.656,14
15,ATRI,ATRION CORP,2019-12-31,2019,262.031,24.161,237.870,155.066,36.761,14
...,...,...,...,...,...,...,...,...,...,...
4719,CALT,CALLIDITAS THERAPE,2019-12-31,2019,90.473,6.115,84.358,19.785,-3.487,14
4720,NMCI,NAVIOS MARITIME CONTAINERS,2019-12-31,2019,460.302,270.322,189.980,141.532,7.507,14
4722,RNLX,RENALYTIX AI PLC,2019-06-30,2019,9.700,1.149,8.551,0.000,-42.301,14
4723,CTRM,CASTOR MARITIME INC,2019-12-31,2019,30.421,17.217,13.204,5.968,1.088,14


### Filtering data on multiple conditions
```Python
df19[(df19['exchg']==14) & (df19['at']>=5000)]
```
- Selecting companies that are listed in NASDAQ and have at least 5 billion total equities.
- You need to put parentheses around each condition, and then connect these conditions with bitwise operators. 
- Python bitwise operators
    - And (&)
    - Or ( | )

- You can also write a new code to specify the condition separately and then refer to it in the function
```Python
condition = (df19['exchg']==14) & (df19['at']>=5000)
df19[condition]
```


---
## Creating Data

### Creating a new variable
```Python
df19['roa'] = df19['ni'] / df19['at']
```
- Creating a new variable called roa by dividing net incomes with total assets.
- Reminder: if you have filtered or subsetted your dataframe, you must use the new dataframe references (names) all throughout this equation.