# Supervised Classification Module (exercise)

**Lecturer:** Ashish Mahabal<br>
**Jupyter Notebook Authors:** Ashish Mahabal & Yuhan Yao

This is a Jupyter notebook lesson extending the GROWTH Summer School 2019 and the NARIT-EACOA 2019 summer workshop and adapted for the La Serena School on Dat Science 2021.

## Objective
Classify different classes using (a) decision trees and (b) random forest 

## Key steps
- Pick variable types
- Select training sample
- Select method
- Look at confusion matrix and details 

## Required dependencies

See GROWTH school webpage for detailed instructions on how to install these modules and packages.  Nominally, you should be able to install the python modules with `pip install <module>`.  The external astromatic packages are easiest installed using package managers (e.g., `rpm`, `apt-get`).

### Python modules
* python 3
* astropy
* numpy
* astroquery
* pandas
* matplotlib
* pydotplus
* IPython.display
* sklearn

### External packages
None

### Partial Credits
Pavlos Protopapas (LSSDS notebook)

## Now you will do a similar set of analyses on a different dataset.
### Here you will use the light curves file to derive features
### And then use the resulting file to run decision trees and random forest on that for classification

#### import the required modules (exercise)

In [1]:
import numpy as np
import pandas as pd
#import ...

#### read the lightcurves file

In [2]:
datadir = 'data'
lightcurves = datadir + '/CRTS_6varclasses.csv.gz'

In [3]:
lcs = pd.read_csv(lightcurves,
                 compression='gzip',
                 header=1,
                 sep=',',
                 skipinitialspace=True,
                 nrows=100000)
                 #skiprows=[4,5])
                 #,nrows=100000)

lcs.columns = ['ID', 'MJD', 'Mag', 'magerr', 'RA', 'Dec']
lcs.head()

Unnamed: 0,ID,MJD,Mag,magerr,RA,Dec
0,1109065026725,53705.501925,16.943797,0.082004,182.25871,9.7658
1,1109065026725,53731.483314,16.645102,0.075203,182.25867,9.76585
2,1109065026725,53731.491406,16.693791,0.076497,182.2587,9.76574
3,1109065026725,53731.499465,16.793651,0.078755,182.25869,9.76576
4,1109065026725,53731.507529,16.767817,0.077436,182.25878,9.76581


In [4]:
len(lcs)

100000

#### We need classes, so load the catalog file too

In [5]:
catalog = datadir + '/CatalinaVars.tbl.gz'

In [6]:
cat = pd.read_csv(catalog,
                 compression='gzip',
                 header=5,
                 sep=' ',
                 skipinitialspace=True,
                 )

columns = cat.columns[1:]
cat = cat[cat.columns[:-1]]
cat.columns = columns

cat.head()

Unnamed: 0,Catalina_Surveys_ID,Numerical_ID,RA_J2000,Dec,V_mag,Period_days,Amplitude,Number_Obs,Var_Type
0,CSS_J000020.4+103118,1109001041232,00:00:20.41,+10:31:18.9,14.62,1.491758,2.39,223,2
1,CSS_J000031.5-084652,1009001044997,00:00:31.50,-08:46:52.3,14.14,0.404185,0.12,163,1
2,CSS_J000036.9+412805,1140001063366,00:00:36.94,+41:28:05.7,17.39,0.274627,0.73,158,1
3,CSS_J000037.5+390308,1138001069849,00:00:37.55,+39:03:08.1,17.74,0.30691,0.23,219,1
4,CSS_J000103.3+105724,1109001050739,00:01:03.37,+10:57:24.4,15.25,1.5837582,0.11,223,8


In [7]:
RRd = cat[ cat['Var_Type'].isin([6]) & (cat['Number_Obs']>100) ]

In [8]:
RRd.head()

Unnamed: 0,Catalina_Surveys_ID,Numerical_ID,RA_J2000,Dec,V_mag,Period_days,Amplitude,Number_Obs,Var_Type
115,CSS_J001420.8+031214,1104002007409,00:14:20.84,+03:12:14.0,17.45,0.38711,0.56,174,6
198,CSS_J001724.9+200542,1121002007726,00:17:24.90,+20:05:42.2,16.64,0.3571291,0.39,224,6
214,CSS_J001812.9+210201,1121002027610,00:18:12.97,+21:02:01.5,14.54,0.41616,0.34,224,6
531,CSS_J003001.7+094947,1109003028079,00:30:01.71,+09:49:47.6,16.91,0.3729404,0.36,212,6
640,CSS_J003359.4+022609,1101004049971,00:33:59.48,+02:26:09.0,15.87,0.3601025,0.27,195,6


### Get numerical ids of objects belonging to the RRd class - call them RRds

In [9]:
RRds = RRd['Numerical_ID']

In [10]:
RRds.head()

115    1104002007409
198    1121002007726
214    1121002027610
531    1109003028079
640    1101004049971
Name: Numerical_ID, dtype: int64

### Let us extract some features from the mags (lets ignore the mag errors for now)

#### For a given id you could do it as follows. isin() accepts a list so you could use the entire RRds there
#### you will lose the id info if you do that in a single step, so you could break it up

In [11]:
lcs[lcs['ID'].isin(['1109065026725'])]['Mag']

0      16.943797
1      16.645102
2      16.693791
3      16.793651
4      16.767817
5      16.885437
6      16.845561
7      16.888531
8      16.941978
9      16.822148
10     16.847925
11     16.878077
12     16.879755
13     16.889322
14     16.943497
15     16.883803
16     16.900262
17     16.742775
18     16.877542
19     16.952343
20     16.906344
21     16.713005
22     16.842459
23     16.895991
24     16.738557
25     16.860171
26     16.881559
27     16.753773
28     16.930406
29     16.613862
         ...    
369    16.476747
370    16.497292
371    16.866625
372    16.930688
373    16.877139
374    16.933141
375    16.885875
376    16.729198
377    16.851957
378    16.861757
379    16.702795
380    16.797014
381    16.781394
382    16.806467
383    16.517148
384    16.556145
385    16.533153
386    16.561263
387    16.462612
388    16.483990
389    16.489717
390    16.485392
391    16.487658
392    16.490796
393    16.474579
394    16.461535
395    16.891826
396    16.8245

In [12]:
lcs[lcs['ID'].isin(RRds)]['Mag']

1212     14.998325
1213     14.984769
1214     15.010683
1215     14.984963
1216     14.817003
1217     14.818442
1218     14.824113
1219     14.816897
1220     14.828202
1221     14.826791
1222     14.808851
1223     14.838217
1224     14.800337
1225     14.813361
1226     14.785231
1227     14.799678
1228     14.925248
1229     14.925225
1230     14.914077
1231     14.905855
1232     14.830322
1233     14.843997
1234     14.837703
1235     14.818362
1236     14.961875
1237     14.988142
1238     14.979043
1239     14.979344
1240     14.869105
1241     14.854159
           ...    
98414    15.953603
98415    15.881508
98416    16.260605
98417    16.210558
98418    16.280644
98419    16.316527
98420    16.273537
98421    16.271487
98422    16.217838
98423    16.203245
98424    15.981133
98425    15.970665
98426    16.038118
98427    16.001187
98428    16.236077
98429    16.322878
98430    16.288100
98431    16.317787
98432    16.290396
98433    16.325196
98434    16.305619
98435    16.

#### Lets assign mags for '1109065026725' to mags (a dictionary)

In [13]:
mags = {}
mags['1109065026725'] = lcs[lcs['ID'].isin(['1109065026725'])]['Mag']

#### Let us get the mean of mags for this one particular object: '1109065026725'

In [14]:
np.mean(mags['1109065026725'].values)

16.717012834586466

#### Assign it to another dictionary with the same key

In [15]:
means = {}
means['1109065026725'] = np.mean(mags['1109065026725'].values)

## Exercise!

### Get mean, median, skew, kurtosis for all ids in our light curves set

In [16]:
from scipy.stats import skew, kurtosis
skew(mags['1109065026725'])
kurtosis(mags['1109065026725'])
np.median(mags['1109065026725'])

16.729198

#### Define the dictionaries that you need

In [17]:
for id in lcs['ID']:
    mean(id) = ...
    median(id) = ...
    s

SyntaxError: can't assign to function call (<ipython-input-17-08f6f5f99ac8>, line 2)

### Now create a CSV file with the following columns
### ID, mean, median, skew, Kurtosis, Class

## Now run decision tree and random forest using these variables by picking a couple of classes