PCA & t-SNE of Github features, classifying assets both by Marketcap and by Coin Function.
Given a dictionary of assets, _assets_, like in Github_analysis.ipynb:
For _feature_ in ['pr_count', 'issues_count', 'commit_add_sum', 'commit_del_sum', 'commit_count', 'star_count', 'return_pcnt', 'volatility_pcnt'] *# SEPARATE NOTEBOOK FOR EACH FEATURE*:
 
   a) create a dataframe, _feature_df_, of the values of _feature_ across all assets
   b) perform PCA on _feature_df_ and plot for both the 1st 2 principal components and 3 components
   c) perform t-SNE on _feature_df_ using 2 components and 3 components and plot each

Purpose: PCA and t-SNE are dimensionality reduction techniques, which help us understand how to visualize high-dimensional datasets.

In [181]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import glob

## Create a dataframe, _feature_df_, of the values of _feature_ across all assets

In [182]:
feature = 'pr_count'
feat_cols = [_feature_, 'usd_market_cap', 'label']

In [183]:
path = 'github_data' 
allFiles = glob.glob(path + "/*.csv")
feature_df = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0, usecols=['symbol',feature,'usd_market_cap'])
    list_.append(df)
feature_df = pd.concat(list_, ignore_index=True)

In [184]:
#arrays with coin functions (our labels)
payment = ['BTC', 'BCH', 'BCD', 'BTG', 'DASH', 'DCR', 'DOGE', 'ETN', 'LTC', 'PIVX', 'XLM', 'XMR', 'XRB', 'XRP', 'ZEC']
utility = ['ADA', 'ARK', 'BTS', 'DGB', 'DRGN', 'EOS', 'FCT', 'GNT', 'LSK', 'NEO', 'OMG', 'QTUM', 'REP', 'RHOC', 'SNT', 'STEEM', 'STRAT', 'WAVES','ZRX']
payment_utility = ['ETH']
asset_utility = ['SC']
unknown = ['BCN', 'XVG', 'ZCL']

In [185]:
for i in range(0, feature_df.shape[0]):

    if feature_df.iloc[i,0] in payment:
        feature_df.loc[i,'label'] = 'payment'
    elif feature_df.iloc[i,0] in utility:
        feature_df.loc[i,'label'] = 'utility'
    elif feature_df.iloc[i,0] in payment_utility:
        feature_df.loc[i,'label'] = 'payment_utility'
    elif feature_df.iloc[i,0] in asset_utility:
        feature_df.loc[i,'label'] = 'asset_utility'
    else:
        feature_df.loc[i,'label'] = 'unknown'

In [186]:
print('Size of the dataframe: {}'.format(feature_df.shape))

Size of the dataframe: (8950, 4)


In [187]:
print(feature_df)

     symbol  pr_count usd_market_cap    label
0       ADA         7              -  utility
1       ADA         4              -  utility
2       ADA         5      624651000  utility
3       ADA        11      540946000  utility
4       ADA         7      569134000  utility
5       ADA         9      553774000  utility
6       ADA         0      477417000  utility
7       ADA         0      542638000  utility
8       ADA         6      527469000  utility
9       ADA         6      573302000  utility
10      ADA         6      556914000  utility
11      ADA         3      584358000  utility
12      ADA         8      675684000  utility
13      ADA         1      878314000  utility
14      ADA         3      852317000  utility
15      ADA         5      776227000  utility
16      ADA         7      750659000  utility
17      ADA         4      705718000  utility
18      ADA         8      692370000  utility
19      ADA         3      699574000  utility
20      ADA         3      7986970

### Perform PCA on _feature_df_ and plot for both the 1st 2 principal components and 3 components
It is using the correlation between some dimensions and tries to provide a minimum number of variables that keeps the maximum amount of variation or information about how the original data is distributed. 

In [192]:
from sklearn.decomposition import PCA

#pca = PCA()
pca = PCA(n_components=3)
pca_result = pca.fit_transform(feature_df[feature].values)

ValueError: Expected 2D array, got 1D array instead:
array=[7. 4. 5. ... 0. 0. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [193]:
pca = PCA(n_components=3)
pca_result = pca.fit_transform(feature_df[feature].values.reshape(-1,1)) 
print(pca_result)

ValueError: n_components=3 must be between 0 and n_features=1 with svd_solver='full'

In [194]:
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1] 
df['pca-three'] = pca_result[:,2]

ValueError: Length of values does not match length of index

In [None]:
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

In [None]:
from ggplot import *

chart = ggplot( feature_df.loc[rndperm[:3000],:], aes(x='pca-one', y='pca-two', color='label') ) \
        + geom_point(size=75,alpha=0.8) \
        + ggtitle("First and Second Principal Components colored by digit")
chart

## Perform t-SNE on _feature_df_ using 2 components and 3 components and plot each

In [None]:
import time
from sklearn.manifold import TSNE

#n_sne = feature_df.shape[0]
labels = df['label']

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(feature_df.loc[:labels,feat_cols].values)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

 Visualise the two dimensions by creating a scatter plot and coloring each sample by its respective label

In [None]:
df_tsne = df.loc[rndperm[:n_sne],:].copy()
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]

chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='label') ) \
        + geom_point(size=70,alpha=0.1) \
        + ggtitle("tSNE dimensions colored by digit")
chart

3 Components

In [None]:
time_start = time.time()
tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(feature_df.loc[:labels,feat_cols].values)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
df_tsne = df.loc[rndperm[:n_sne],:].copy()
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]

chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='label') ) \
        + geom_point(size=70,alpha=0.1) \
        + ggtitle("tSNE dimensions colored by digit")
chart