# Clustering Mapper

## Étapes

* Lisser par rapport au temps (B)
* Passer au log
* Enlever les index
* Normaliser
* ACP (JB)
* km.cover(n = 20, cov = 0.5) (G)
* km.map(ACP, data, cover)
* Clustering (JB/M)
* Créer le graph (M)

## Importation des modules

### Import des modules de bases

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

### Pour normaliser les données

Separating out the features

    x = df.loc[:, features].values

Standardizing the features

    x = StandardScaler().fit_transform(x)

In [2]:
from sklearn.preprocessing import StandardScaler

### Pour faire l'ACP

Initialise la classe

    pca = PCA(n_components=2)

Fit le modèle

    principalComponents = pca.fit_transform(x)

Transforme en df pandas

    principalDf = pd.DataFrame(data = principalComponents
                , columns = ['principal component 1', 'principal component 2'])
    finalDf = pd.concat([df[index]], principalDf, axis = 1)

In [3]:
from sklearn.decomposition import PCA

### Pour faire le clustering

En utilisant sklearn :

    model = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='single')
    model.fit(X)
    labels = model.labels_

In [4]:
from sklearn.cluster import AgglomerativeClustering

En utilisant scipy :

    link = sch.linkage(y, method='single", metric='...')
    dendrogram = sch.dendrogram(link)

Voir https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

In [5]:
import scipy.cluster.hierarchy as sch 

### Keppler Mapper

In [6]:
import kmapper as km
from kmapper.cover import Cover
from kmapper import jupyter # Creates custom CSS full-size Jupyter screen

## Chargement des données

In [7]:
# data_firm_level = pd.read_stata("../Data/Firm_patent/data_firm_level.dta")
# data_patent_level = pd.read_stata("../Data/Patent_level_data/data_patent_level.dta")
# cites = pd.read_stata("../Data/Patent_level_data/USPatent_1926-2010/cites/cites.dta")
# firm_innovation_v2 = pd.read_stata("../Data/Patent_level_data/USPatent_1926-2010/firm_innovation/firm_innovation_v2.dta")
# patents_xi = pd.read_stata("../Data/Patent_level_data/USPatent_1926-2010/patents_xi/patents_xi.dta")
# patent_values = pd.read_stata("../Data/Patent_level_data/Patent_CRSP_match_1929-2017/patent_values/patent_values.dta")

## Utilisation de la base merged

### Récupération des données en dataframe pandas

In [8]:
patents_firm_merge = pd.read_stata("../Data/Firm_patent/patents_firm_merge.dta")

In [9]:
patents_firm_merge

Unnamed: 0,index,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
0,37352,1605417,10/23/1923,11/02/1926,,10006.0,403.0,206000O,2.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
1,11188,1579234,05/14/1923,04/06/1926,,10006.0,74.0,503000O,4.0,0.031358,1926,10,18.980768,0.375693,,,matched (3)
2,37345,1605410,04/09/1926,11/02/1926,,10006.0,295.0,042000O,1.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
3,37377,1605442,03/24/1922,11/02/1926,,10006.0,164.0,168000O,0.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
4,37350,1605415,01/22/1923,11/02/1926,,10006.0,267.0,086000O,2.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1844876,4252965,5834226,01/31/1991,11/10/1998,,93236.0,435.0,015000O,23.0,0.107588,1998,1,2.480992,0.107588,,,matched (3)
1844877,2942383,4515145,10/03/1983,05/07/1985,,93252.0,126.0,09900AO,24.0,0.141288,1985,1,2.896310,0.141288,0.734918,0.035851,matched (3)
1844878,3287665,4860725,04/27/1989,08/29/1989,,93252.0,126.0,11000RO,6.0,0.092077,1989,1,1.415194,0.092077,0.449839,0.029268,matched (3)
1844879,3000202,4573009,12/07/1983,02/25/1986,,93287.0,324.0,750250O,12.0,0.457411,1986,2,4.185905,0.830930,0.081887,0.016255,matched (3)


### Utilise les index données dans la df et convertit les dates

In [10]:
datetime_df = patents_firm_merge.set_index("index")
for col in ["fdate", "idate", "pdate"]:
    datetime_df[col] = pd.to_datetime(patents_firm_merge[col], infer_datetime_format=True, errors="coerce")

In [11]:
datetime_df.dtypes

patnum                   int32
fdate           datetime64[ns]
idate           datetime64[ns]
pdate           datetime64[ns]
permno                 float64
patent_class            object
subclass                object
ncites                 float64
xi                     float64
year                     int32
Npats                    int32
Tcw                    float64
Tsm                    float64
tcw                    float64
tsm                    float64
_merge                category
dtype: object

### On enlève les lignes incomplètes

On voit le pourcentage de lignes non vides pour chaques colonnes :

In [12]:
datetime_df.count()/len(datetime_df)

patnum          1.000000
fdate           0.179741
idate           0.181243
pdate           0.050828
permno          1.000000
patent_class    1.000000
subclass        1.000000
ncites          0.976691
xi              0.976691
year            1.000000
Npats           1.000000
Tcw             1.000000
Tsm             1.000000
tcw             0.904892
tsm             0.904892
_merge          1.000000
dtype: float64

In [13]:
full_df = datetime_df.dropna(subset=['xi', 'ncites', 'tcw', 'tsm'])

In [14]:
full_df.count()/len(full_df)

patnum          1.000000
fdate           0.115469
idate           0.116026
pdate           0.042676
permno          1.000000
patent_class    1.000000
subclass        1.000000
ncites          1.000000
xi              1.000000
year            1.000000
Npats           1.000000
Tcw             1.000000
Tsm             1.000000
tcw             1.000000
tsm             1.000000
_merge          1.000000
dtype: float64

In [15]:
full_df

Unnamed: 0_level_0,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
945816,2514534,1967-10-12,1969-12-30,NaT,10006.0,295.0,042000O,0.0,0.120056,1950,6,8.378170,0.782956,0.057741,0.005396,matched (3)
935937,2504645,1985-05-08,1988-02-16,NaT,10006.0,297.0,067000O,3.0,0.102665,1950,6,8.378170,0.782956,0.057741,0.005396,matched (3)
959346,2528074,NaT,1972-09-19,NaT,10006.0,105.0,457000O,1.0,0.142487,1950,6,8.378170,0.782956,0.057741,0.005396,matched (3)
927976,2496677,1998-01-16,1999-06-29,NaT,10006.0,285.0,189000O,6.0,0.135619,1950,6,8.378170,0.782956,0.057741,0.005396,matched (3)
939627,2508339,1995-06-16,1997-06-17,NaT,10006.0,105.0,004100O,2.0,0.132544,1950,6,8.378170,0.782956,0.057741,0.005396,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4120707,5699182,NaT,NaT,NaT,93236.0,359.0,321000O,3.0,0.279402,1997,1,1.191137,0.279402,0.331701,0.077806,matched (3)
2942383,4515145,NaT,NaT,NaT,93252.0,126.0,09900AO,24.0,0.141288,1985,1,2.896310,0.141288,0.734918,0.035851,matched (3)
3287665,4860725,NaT,NaT,NaT,93252.0,126.0,11000RO,6.0,0.092077,1989,1,1.415194,0.092077,0.449839,0.029268,matched (3)
3000202,4573009,NaT,NaT,NaT,93287.0,324.0,750250O,12.0,0.457411,1986,2,4.185905,0.830930,0.081887,0.016255,matched (3)


### On lisse les données numériques par rapport au temps

In [16]:
features = ["xi", "Tcw", "Tsm", "tcw", "tsm", "ncites"]
SMA_features = ["SMA_"+l for l in features]

In [17]:
full_df[SMA_features] = full_df.sort_values(by="idate"
).groupby(["permno", "patent_class"]
)[features
].rolling(window=5, min_periods=1
).mean(
).reset_index(level=[0, 1], drop=True
).rename(columns={l: "SMA_"+l for l in features})

In [18]:
for l in features:
    full_df["log_"+l] = np.log(1 + full_df["SMA_"+l])

### On normalise les données numériques lissées et passées au log

In [19]:
matrix = full_df[['log_xi', 'log_Tcw', 'log_Tsm', 'log_tcw', 'log_tsm', 'log_ncites']]

In [20]:
matrix

Unnamed: 0_level_0,log_xi,log_Tcw,log_Tsm,log_tcw,log_tsm,log_ncites
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
945816,0.113379,2.238385,0.578272,0.056135,0.005381,0.000000
935937,0.051180,3.433527,0.856201,0.178222,0.008833,1.832581
959346,0.088516,3.448855,0.868194,0.179613,0.008955,0.559616
927976,0.127178,2.238385,0.578272,0.056135,0.005381,1.945910
939627,0.617228,4.432503,3.219052,0.253160,0.070605,2.397895
...,...,...,...,...,...,...
4120707,0.246393,0.784421,0.246393,0.286457,0.074928,1.386294
2942383,0.132157,1.360030,0.132157,0.550960,0.035223,3.218876
3287665,0.110362,1.149227,0.110362,0.465229,0.032041,2.772589
3000202,0.376662,1.645944,0.604824,0.078707,0.016124,2.564949


In [21]:
normalised_matrix = StandardScaler().fit_transform(matrix)

In [22]:
normalised_matrix

array([[-1.23362196, -2.18521409, -2.3238568 , -0.38447203, -0.81615607,
        -2.12685935],
       [-1.27959364, -1.49767179, -2.21475615,  0.26466516, -0.80346417,
        -0.20609846],
       [-1.25199847, -1.48885425, -2.2100485 ,  0.27206083, -0.80301852,
        -1.54031619],
       ...,
       [-1.23585146, -2.81178535, -2.50753463,  1.79068831, -0.71813784,
         0.77913969],
       [-1.03902582, -2.52603353, -2.31343393, -0.26445808, -0.77665703,
         0.56150922],
       [-1.08284464, -2.52603353, -2.31343393, -0.26445808, -0.77665703,
         0.90259016]])

### On fait une ACP sur cette matrice

Puis on rajoute les indices

In [23]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(normalised_matrix)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])

In [24]:
projected_data = pd.concat([matrix.reset_index()["index"], principalDf], axis=1).set_index("index")

In [25]:
projected_data

Unnamed: 0_level_0,PC1,PC2
index,Unnamed: 1_level_1,Unnamed: 2_level_1
945816,-3.158073,-0.616306
935937,-2.922589,0.442155
959346,-2.979120,-0.339355
927976,-3.037132,0.587018
939627,-1.908329,0.721535
...,...,...
4120707,-3.517691,1.298113
2942383,-3.743391,2.790867
3287665,-3.729543,2.373993
3000202,-3.012566,1.233964


### On applique le Mapper Algorithm

In [26]:
# Initialize
mapper = km.KeplerMapper(verbose=0)
# Cover
cov = Cover(n_cubes=20, perc_overlap=0.5)

In [27]:
proj_matrix = mapper.fit_transform(X=matrix, projection=PCA(n_components=2), scaler=StandardScaler())

In [28]:
# Create dictionary called 'graph' with nodes, edges and meta-information
graph = mapper.map(lens=projected_data, X=normalised_matrix, cover=cov, clusterer=AgglomerativeClustering(10, linkage="single"))

In [None]:
# Visualize it
html = mapper.visualize(graph, path_html="../docs/MapperCluster.html", title="Mapper Clustering Algorithm")

# Inline display
# jupyter.display(path_html="../docs/MapperCluster.html")