# Clustering Mapper

## Étapes

* Lisser par rapport au temps (B)
* Passer au log
* Enlever les index
* Normaliser
* ACP (JB)
* km.cover(n = 20, cov = 0.5) (G)
* km.map(ACP, data, cover)
* Clustering (JB/M)
* Créer le graph (M)

## Importation des modules

### Import des modules de bases

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

### Pour normaliser les données

Separating out the features

    x = df.loc[:, features].values

Standardizing the features

    x = StandardScaler().fit_transform(x)

In [1]:
from sklearn.preprocessing import StandardScaler

### Pour faire l'ACP

Initialise la classe

    pca = PCA(n_components=2)

Fit le modèle

    principalComponents = pca.fit_transform(x)

Transforme en df pandas

    principalDf = pd.DataFrame(data = principalComponents
                , columns = ['principal component 1', 'principal component 2'])
    finalDf = pd.concat([df[index]], principalDf, axis = 1)

In [1]:
from sklearn.decomposition import PCA

### Pour faire le clustering

En utilisant sklearn :

    model = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='single')
    model.fit(X)
    labels = model.labels_

In [1]:
from sklearn.cluster import AgglomerativeClustering

En utilisant scipy :

    link = sch.linkage(y, method='single", metric='...')
    dendrogram = sch.dendrogram(link)

Voir https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

In [1]:
import scipy.cluster.hierarchy as sch 

### Keppler Mapper

In [3]:
import kmapper as km
from kmapper.cover import Cover
from kmapper import jupyter # Creates custom CSS full-size Jupyter screen

## Chargement des données

In [8]:
# data_firm_level = pd.read_stata("../data/Firm_patent/data_firm_level.dta")
# data_patent_level = pd.read_stata("../data/Patent_level_data/data_patent_level.dta")
# cites = pd.read_stata("../data/Patent_level_data/USPatent_1926-2010/cites/cites.dta")
# firm_innovation_v2 = pd.read_stata("../data/Patent_level_data/USPatent_1926-2010/firm_innovation/firm_innovation_v2.dta")
# patents_xi = pd.read_stata("../data/Patent_level_data/USPatent_1926-2010/patents_xi/patents_xi.dta")
# patent_values = pd.read_stata("../data/Patent_level_data/Patent_CRSP_match_1929-2017/patent_values/patent_values.dta")
# patents_firm_merge = pd.read_stata("../data/Firm_patent/patents_firm_merge.dta")

## Utilisation de la base merged

### Récupération des données en dataframe pandas

In [4]:
patents_firm_merge = pd.read_stata("../data/Firm_patent/patents_firm_merge.dta")

In [9]:
patents_firm_merge.head()

Unnamed: 0,index,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
0,37352,1605417,10/23/1923,11/02/1926,,10006.0,403.0,206000O,2.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
1,11188,1579234,05/14/1923,04/06/1926,,10006.0,74.0,503000O,4.0,0.031358,1926,10,18.980768,0.375693,,,matched (3)
2,37345,1605410,04/09/1926,11/02/1926,,10006.0,295.0,042000O,1.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
3,37377,1605442,03/24/1922,11/02/1926,,10006.0,164.0,168000O,0.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)
4,37350,1605415,01/22/1923,11/02/1926,,10006.0,267.0,086000O,2.0,0.046886,1926,10,18.980768,0.375693,,,matched (3)


In [11]:
patents_firm_merge.count()/len(patents_firm_merge)

index           1.000000
patnum          1.000000
fdate           1.000000
idate           1.000000
pdate           1.000000
permno          1.000000
patent_class    1.000000
subclass        1.000000
ncites          0.976691
xi              0.976691
year            1.000000
Npats           1.000000
Tcw             1.000000
Tsm             1.000000
tcw             0.904892
tsm             0.904892
_merge          1.000000
dtype: float64

### On garde les grandes entreprises

In [12]:
patents_firm_merge[patents_firm_merge["permno"]==12490.0]

Unnamed: 0,index,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
328307,38913,1606979,04/03/1926,11/16/1926,,12490.0,206.0,375000O,1.0,0.151648,1926,1,1.561298,0.151648,,,matched (3)
328308,78759,1646850,06/04/1925,10/25/1927,,12490.0,279.0,051000O,1.0,0.268815,1927,2,3.655511,0.500838,,,matched (3)
328309,58785,1626871,07/09/1924,05/03/1927,,12490.0,234.0,030000O,2.0,0.232024,1927,2,3.655511,0.500838,,,matched (3)
328310,104932,1673039,06/30/1926,06/12/1928,,12490.0,407.0,064000O,5.0,0.379686,1928,3,6.232343,1.178278,,,matched (3)
328311,106553,1674660,04/17/1922,06/26/1928,,12490.0,271.0,258050O,0.0,0.393123,1928,3,6.232343,1.178278,,,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400802,6057276,7644291,12/30/2005,01/05/2010,07/05/2007,12490.0,713.0,300000O,0.0,7.441997,2010,5052,9143.233767,20864.188617,0.080591,0.183903,matched (3)
400803,6081856,7668905,12/01/2005,02/23/2010,03/08/2007,12490.0,709.0,203000O,0.0,3.712214,2010,5052,9143.233767,20864.188617,0.080591,0.183903,matched (3)
400804,6077671,7664711,12/16/2002,02/16/2010,06/17/2004,12490.0,705.0,412000O,1.0,18.711040,2010,5052,9143.233767,20864.188617,0.080591,0.183903,matched (3)
400805,6102662,7689774,04/06/2007,03/30/2010,10/09/2008,12490.0,711.0,137000O,0.0,3.650043,2010,5052,9143.233767,20864.188617,0.080591,0.183903,matched (3)


In [13]:
patents_firm_merge["permno"].nunique()

6995

In [14]:
big_firms = patents_firm_merge.groupby("permno"
)["Npats"
].mean(
).sort_values(ascending=False
).iloc[10:13]

In [15]:
big_firms_index = big_firms.reset_index()["permno"].values

In [16]:
reduced_data = patents_firm_merge[patents_firm_merge["permno"].isin(big_firms_index)]

In [17]:
reduced_data

Unnamed: 0,index,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
1104029,1440696,3010070,05/31/1960,11/21/1961,,27828.0,327.0,134000O,3.0,1.580217,1961,11,35.180475,30.088601,,,matched (3)
1104030,1424641,2994012,03/17/1960,07/25/1961,,27828.0,315.0,206000O,2.0,3.676502,1961,11,35.180475,30.088601,,,matched (3)
1104031,1441755,3011129,08/10/1959,11/28/1961,,27828.0,315.0,076000O,33.0,3.569988,1961,11,35.180475,30.088601,,,matched (3)
1104032,1431758,3001131,06/30/1958,09/19/1961,,27828.0,324.0,074000O,8.0,2.846502,1961,11,35.180475,30.088601,,,matched (3)
1104033,1432778,3002151,06/18/1957,09/26/1961,,27828.0,327.0,176000O,7.0,2.763990,1961,11,35.180475,30.088601,,,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1815296,6145440,7732709,07/16/2007,06/08/2010,01/24/2008,88935.0,174.0,050000O,2.0,1.675647,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
1815297,6222712,7810186,09/20/2005,10/12/2010,05/01/2008,88935.0,5.0,601000O,0.0,1.581025,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
1815298,6116429,7703589,03/22/2006,04/27/2010,05/14/2009,88935.0,191.0,032000O,0.0,0.862609,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
1815299,6213582,7801038,10/23/2003,09/21/2010,01/20/2005,88935.0,370.0,230100O,0.0,1.239622,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)


### Utilise les index données dans la df et convertit les dates

In [18]:
datetime_df = reduced_data
for col in ["fdate", "idate", "pdate"]:
    a = pd.to_datetime(reduced_data[col], format="%m/%d/%Y", errors="coerce")
    datetime_df[col] = a
datetime_df.set_index("index", inplace=True)

In [19]:
datetime_df.dtypes

patnum                   int32
fdate           datetime64[ns]
idate           datetime64[ns]
pdate           datetime64[ns]
permno                 float64
patent_class            object
subclass                object
ncites                 float64
xi                     float64
year                     int32
Npats                    int32
Tcw                    float64
Tsm                    float64
tcw                    float64
tsm                    float64
_merge                category
dtype: object

### On enlève les lignes incomplètes

On voit le pourcentage de lignes non vides pour chaques colonnes :

In [20]:
datetime_df.count()/len(datetime_df)

patnum          1.000000
fdate           0.998908
idate           1.000000
pdate           0.568298
permno          1.000000
patent_class    1.000000
subclass        1.000000
ncites          0.996188
xi              0.996188
year            1.000000
Npats           1.000000
Tcw             1.000000
Tsm             1.000000
tcw             0.999789
tsm             0.999789
_merge          1.000000
dtype: float64

In [21]:
full_df = datetime_df.dropna(subset=['xi', 'ncites', 'tcw', 'tsm'])

In [22]:
full_df.count()/len(full_df)

patnum          1.000000
fdate           0.999423
idate           1.000000
pdate           0.568343
permno          1.000000
patent_class    1.000000
subclass        1.000000
ncites          1.000000
xi              1.000000
year            1.000000
Npats           1.000000
Tcw             1.000000
Tsm             1.000000
tcw             1.000000
tsm             1.000000
_merge          1.000000
dtype: float64

In [23]:
full_df

Unnamed: 0_level_0,patnum,fdate,idate,pdate,permno,patent_class,subclass,ncites,xi,year,Npats,Tcw,Tsm,tcw,tsm,_merge
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1470886,3040265,1960-07-18,1962-06-19,NaT,27828.0,330,290000O,6.0,0.239454,1962,7,20.538391,3.101149,0.339478,0.051259,matched (3)
1472156,3041535,1959-01-12,1962-06-26,NaT,27828.0,324,118000O,22.0,1.664426,1962,7,20.538391,3.101149,0.339478,0.051259,matched (3)
1470827,3040206,1959-11-04,1962-06-19,NaT,27828.0,315,383000O,2.0,0.239454,1962,7,20.538391,3.101149,0.339478,0.051259,matched (3)
1470807,3040186,1960-09-19,1962-06-19,NaT,27828.0,327,114000O,13.0,0.239454,1962,7,20.538391,3.101149,0.339478,0.051259,matched (3)
1470779,3040158,1960-12-01,1962-06-19,NaT,27828.0,219,210000O,12.0,0.239454,1962,7,20.538391,3.101149,0.339478,0.051259,matched (3)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6145440,7732709,2007-07-16,2010-06-08,2008-01-24,88935.0,174.0,050000O,2.0,1.675647,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
6222712,7810186,2005-09-20,2010-10-12,2008-05-01,88935.0,5.0,601000O,0.0,1.581025,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
6116429,7703589,2006-03-22,2010-04-27,2009-05-14,88935.0,191.0,032000O,0.0,0.862609,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)
6213582,7801038,2003-10-23,2010-09-21,2005-01-20,88935.0,370.0,230100O,0.0,1.239622,2010,1060,1794.248698,1190.591984,0.012829,0.008513,matched (3)


### On lisse les données numériques par rapport au temps

In [24]:
features = ["xi", "Tcw", "Tsm", "tcw", "tsm", "ncites"]
SMA_features = ["SMA_"+l for l in features]

In [25]:
full_df[SMA_features] = full_df.sort_values(by="idate"
).groupby(["permno", "patent_class"]
)[features
].rolling(window=5, min_periods=1 # =5?
).mean(
).reset_index(level=[0, 1], drop=True
).rename(columns={l: "SMA_"+l for l in features})

In [26]:
for l in features:
    full_df["log_"+l] = np.log(1 + full_df["SMA_"+l])

### On normalise les données numériques lissées et passées au log

In [27]:
matrix = full_df[['log_xi', 'log_Tcw', 'log_Tsm', 'log_tcw', 'log_tsm', 'log_ncites']]

In [28]:
matrix

Unnamed: 0_level_0,log_xi,log_Tcw,log_Tsm,log_tcw,log_tsm,log_ncites
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1470886,0.214671,3.069837,1.411267,0.292280,0.049988,1.945910
1472156,0.668824,3.069837,1.411267,0.292280,0.049988,2.995732
1470827,0.214671,3.069837,1.411267,0.292280,0.049988,1.098612
1470807,0.214671,3.069837,1.411267,0.292280,0.049988,2.639057
1470779,0.214671,3.069837,1.411267,0.292280,0.049988,2.564949
...,...,...,...,...,...,...
6145440,0.797136,7.489044,7.187985,0.012752,0.009454,0.470004
6222712,0.790934,7.492899,7.083045,0.012748,0.008477,0.000000
6116429,0.273720,7.496506,6.072049,0.018200,0.003767,0.182322
6213582,0.780494,7.492899,7.083045,0.012748,0.008477,0.000000


In [29]:
normalised_matrix = StandardScaler().fit_transform(matrix)

In [30]:
normalised_matrix

array([[-1.02521922, -5.658049  , -3.01405747,  6.38254427, -0.52548486,
         0.30047711],
       [-0.68574681, -5.658049  , -3.01405747,  6.38254427, -0.52548486,
         1.3422254 ],
       [-1.02521922, -5.658049  , -3.01405747,  6.38254427, -0.52548486,
        -0.54030447],
       ...,
       [-0.98108127,  0.04799788, -0.7894394 , -0.44953741, -0.72288655,
        -1.44954831],
       [-0.60227526,  0.04334845, -0.30688465, -0.58544459, -0.70277041,
        -1.6304677 ],
       [-0.47020839,  0.03837899, -0.2567962 , -0.58533426, -0.69859983,
        -1.16407871]])

### On fait une ACP sur cette matrice

Puis on rajoute les indices

In [31]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(normalised_matrix)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])

In [32]:
projected_data = pd.concat([matrix.reset_index()["index"], principalDf], axis=1).set_index("index")

In [33]:
projected_data

Unnamed: 0_level_0,PC1,PC2
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1470886,-0.907068,7.830799
1472156,-0.443467,8.275950
1470827,-1.130389,7.445494
1470807,-0.724377,8.146004
1470779,-0.743909,8.112304
...,...,...
6145440,-1.302928,-0.714676
6222712,-1.456056,-0.913321
6116429,-1.832127,-0.563809
6213582,-1.460353,-0.912580


### On applique le Mapper Algorithm

In [34]:
# Initialize
mapper = km.KeplerMapper(verbose=1)

KeplerMapper(verbose=1)


In [35]:
proj_matrix = mapper.fit_transform(X=normalised_matrix, projection=PCA(n_components=2)) # , scaler=StandardScaler())

..Composing projection pipeline of length 1:
	Projections: PCA(n_components=2)
	Distance matrices: False
	Scalers: MinMaxScaler()
..Projecting on data shaped (51995, 6)

..Projecting data using: 
	PCA(n_components=2)


..Scaling with: MinMaxScaler()



In [36]:
proj_matrix

array([[0.1984046 , 0.86944952],
       [0.25872037, 0.90922807],
       [0.16934996, 0.8350188 ],
       ...,
       [0.07805192, 0.11930942],
       [0.1264207 , 0.0881433 ],
       [0.15547107, 0.10481237]])

In [39]:
# Create dictionary called 'graph' with nodes, edges and meta-information
graph = mapper.map(lens=proj_matrix, X=normalised_matrix, cover=Cover(n_cubes=20, perc_overlap=0.5)) #, clusterer=AgglomerativeClustering(n_clusters=[2], linkage="single"))

Mapping on data shaped (51995, 6) using lens shaped (51995, 2)

Creating 25 hypercubes.

Created 206 edges and 108 nodes in 0:01:00.624035.


In [40]:
# Visualize it
html = mapper.visualize(graph, path_html="../docs/MapperCluster2.html", title="Mapper Clustering Algorithm")

# Inline display
# jupyter.display(path_html="../docs/MapperCluster.html")

Wrote visualization to: ../docs/MapperCluste2r.html
