## Matching Algorithm basierend auf Interessen und/oder Profiltexten

### Initialize modules and functions

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

def text_preprocessing(df):
    # removing special characters and stop words from the text
    stop_words_l=stopwords.words('english')
    df['profiles_cleaned']=df.profiles.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l) )
    # local importance: tf-idf
    tfidfvectoriser=TfidfVectorizer(max_features=64)
    tfidfvectoriser.fit(df.profiles_cleaned)
    tfidf_vectors=tfidfvectoriser.transform(df.profiles_cleaned)
    tfidf_vectors=tfidf_vectors.toarray()
    # tokenize and pad every document to make them of the same size
    tokenizer=Tokenizer()
    tokenizer.fit_on_texts(df.profiles_cleaned)
    tokenized_documents=tokenizer.texts_to_sequences(df.profiles_cleaned)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    return df, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size

def most_similar(df, id,similarity_matrix):
    print (f'Profile: {df.iloc[id]["names"]}')
    print ('\n')
    print (f'Similar Profiles using Cosine Similarity')
    similar_ix=np.argsort(similarity_matrix[id])[::-1]
    for ix in similar_ix:
        if ix==id:
            continue
        print('\n')
        print (f'Profile: {df.iloc[ix]["names"]}')
        print (f'Cosine Similarity: {similarity_matrix[id][ix]}')

def get_glove_similarity(df, df_idx, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size):
    # reading Glove word embeddings into a dictionary with "word" as key and values as word vectors
    embeddings_index = dict()
    with open('glove.6B.100d.txt') as file:
        for line in file:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    # creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
    embedding_matrix=np.zeros((vocab_size,100))
    for word,i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    # tf-idf vectors do not keep the original sequence of words, converting them into actual word sequences from the documents
    document_embeddings=np.zeros((len(tokenized_paded_documents),100))
    words=tfidfvectoriser.get_feature_names_out()
    for i in range(df.shape[0]):
        for j in range(len(words)):
            document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]     
    document_embeddings=document_embeddings/np.sum(tfidf_vectors,axis=1).reshape(-1,1)
    most_similar(df, df_idx,cosine_similarity(document_embeddings))

Gegeben sind verschiedene Profile:

![Example Profiles](profiles.png)

In [2]:
# Create profiles as dataframe
outdoor_anna = ["dog standup paddling hiking"]
gamer_thomas = ["fortnite boardgame tetris poker netflix"]
adventure_alex = ["paragliding kite surfing aethletics skiing mountain sports"]
rock_robin = ["guitar singer drums"]
profiles = [outdoor_anna, gamer_thomas, adventure_alex, rock_robin]
df=pd.DataFrame(profiles,columns=['profiles'])
df["names"] = ["outdoor_anna", "gamer_thomas", "adventure_alex", "rock_robin"]

print(df)


                                            profiles           names
0                        dog standup paddling hiking    outdoor_anna
1            fortnite boardgame tetris poker netflix    gamer_thomas
2  paragliding kite surfing aethletics skiing mou...  adventure_alex
3                                guitar singer drums      rock_robin




Für diese Profile soll ein Matching-Score errechnet werden. 
Zwei Profile haben eine hohe Matching-Score, wenn die Interessen der Personen verwandt sind. Interessen sind verwandt, wenn beide Interessen häufig im selben globalen Kontext genannt werden (zum Beispiel Gitarre spielen wird häufig im selben Kontext genannt wie singen). 
GloVe (Global Vector) ist ein vortrainiertes Modell, das die Häufigkeit des gemeinsamen Auftretens von Wörtern beinhaltet.  

### Preprocessing


Die Ähnlichkeit der Interessen wird mathematisch durch die Cosinus-Ähnlichkeit errechnet. Dafür werden die Interessen zunächst in eine quantifizierbare Form gebracht. Dazu gehören:
* Kleinschreibung und Entfernung von Sonderzeichen
* Vektorisierung (Wörter werden einzeln betrachtet und erhalten eine Zahl)
* Normierung durch lokale Frequenz gewichtet durch Häufigkeit pro Profil und durch Häufigkeit in allen Profilen




In [3]:
# preprocessing
df, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size = text_preprocessing(df) 


#### Glove embeddings
Die Werte des vortrainierten GloVe Modells werden über die Datei glove.6B.100d.txt eingeladen ([Link hier](https://nlp.stanford.edu/projects/glove/)). Eine Matrix mit Ähnlichkeitswerten für das lokale Vokabular wird aus dem GloVe-Modell gezogen. Die Ähnlichkeitswerte werden dann gewichtet durch die lokale Häufigkeit der Wörter. Ein künstliches Beispiel für die Ähnlichkeitswerte der Wörter, die wir aus dem GloVe-Modell erhalten, könnte wie folgt aussehen:

<img src="vector_representation2.png" alt="Vector Representation" width="500"/>

Jedes Wort wird als Vektor ausgedrückt, wobei die Position die semantische Ähnlichkeit zu den benachbarten Wörtern ausdrückt. 

Zuletztt wird für jedes Paar von Profilen die Cosinus-Ähnlichkeit berechnet nach folgender Formel:

<img src="cosine_formula.png" alt="drawing" width="500"/>


In [4]:
get_glove_similarity(df, 0, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size)

Profile: outdoor_anna


Similar Profiles using Cosine Similarity


Profile: adventure_alex
Cosine Similarity: 0.6357523292935934


Profile: gamer_thomas
Cosine Similarity: 0.26693923773546746


Profile: rock_robin
Cosine Similarity: 0.1822169619312551


# Profiltext
Matching-Scores können ebenfalls anhand von Profiltexten berrechnet werden.

In [5]:
Junior_DS_Caro = ["Junior data scientist with an MSc in Applied mathematics with experience throughout the data processing lifecycle. Programming, data wrangling/data cleaning, and analysis skills. Interested in developing machine learning applications to solve real-world problems."]
Senior_DS_Hanna = ["Passionate about data-based insights and decisions since 2016. I have 6+ years of experience in data-driven research and data consulting, and 2+ of experience in agile project management and cross-functional team work. I hold a PhD in psychology and neuroscience and further applied data science methodologies in the area of mobility behavior. I'm offering a team-oriented, proactive and research mindset. Currently, I'm expanding my skills in cloud services and big data methodologies (e.g. Apache Spark). Since I believe that the world needs more female leaders, I want to contribute to prepare young women for a career in female leadership."]
UX_Designer_Tanja = ["I'm a Munich-based UX Designer & Researcher with a B.Sc. in Applied Psychology and professional training in Acting. Acting equipped me with observational & active listening skills, whereas psychology helped me develop hard skills like critical thinking & statistical analysis, and made me understand the cognitive process behind human behavior even better."]
profiles = [Junior_DS_Caro, Senior_DS_Hanna, UX_Designer_Tanja]
df=pd.DataFrame(profiles,columns=['profiles'])
df["names"] = ["Junior_DS_Caro", "Senior_DS_Hanna", "UX_Designer_Tanja"]
df, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size = text_preprocessing(df) 
get_glove_similarity(df, 0, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size)

Profile: Junior_DS_Caro


Similar Profiles using Cosine Similarity


Profile: Senior_DS_Hanna
Cosine Similarity: 0.9222393739200262


Profile: UX_Designer_Tanja
Cosine Similarity: 0.8539059446766031


# Zufällig gewählte Interessen


In [6]:
profil1 = ['Duranguense Ebel Neo Metal 8 Ball Pool Painkiller Swedish Doom Metal Chapati Tahari by ASL Ylands Smelt Mega Man II Act of War Mongolian Hip Hop Aerial Yoga Eggomania Underwater Sports Pollack Ska Argentino TOGG Pandora Poland OAMC Badminton Omaha Indie Japanese Idm Kishu Warhammer: Vermintide 2 Ark Turkish Trap Pop Malmo Indie Breakout Daniel Malchert Chinese Indie Pop Filmi Belgian Indie Rock Talentos Brasileiros Dandie Dinmont Terrier Onzie Earthsiege Peanut Hi no Tori Hououhen Gaou no Bouken Mario Kart Tour Boucheron Day Trading Tsukiuta B-Movie Lifestyle Sonic & Sega All-Stars Racing Villeroy & Boch Serbian Indie']
profil2 = [ 'The Whispered World Dnafight Greyhound No Wave Swiss Black Metal Japanese Concert Band Shirley Temple Black Malian Traditional Dueber-Hampden Chanson Virale Tzatziki Prey the Stars Trivia Crack Korean Phantom Singer Rap Metal David Ramsay Sniper Fury German Metal The Warriors Danish Electropop Jazz Trumpet Jazz Organ Kenyan Traditional Tzadik Rap Nacional Antigo Tänzer Kinect Adventures! Japanese Ska Daniela Villegas OPI Transpop Dutch Death Metal Art Song Inov 8 Cheerleading Stupell Industries Rebecca Minkoff Uk Contemporary R&B Pittsburgh Indie Taur Tekka Maki Simutrans Vgm Instrumental Korean Underground Rap Svensk Progg Turin Indie Musica Valenciana Crypto Horror Drama Soul Blues']
profil3 =  ['Ylati Appalachian Folk Balfolk Chicago House Bloco Croatia Exotica Grand Anglo_Français Blanc et Orange Psychology Of Collecting Cape Breton Folk The Legend of Korra Cut the Rope\xa0Magic Streaker Afro Funk Yemeni Traditional Transport Tycoon Fuse New Ghostbusters II Frutti di scusa Modern Hard Rock Dutch Smoushond Dubstep FIFA Gbvfi Historic Value Indie Rock Extremadura Rolling Sky Wipeout in the Zone Histoire Pour Enfants Star-Trek Bridge Commander Dutch Singer Songwriter Gloria Vanderbilt Australian Hip Hop Hanes Ultimate Folklore Nuevo Argentino BYD Auto Champagne Classic Afrobeat ZEW New Tribe Etnies Perry Ellis Atlanta Metal Austrian Contemporary Classical Gaita Zuliana Car Manufactures Harrier Onex Pitfall!']
profil4  =['Area 51 Garoto Red Dead Revolver Singaporean Punk German Punk Rock Didgeridoo Treeing Feist Peter I Island Flying Heroes J.crew Hottāman no Chitei Tanken Qed London Uk Diy Punk Pegu ModWay Russian Death Metal Musica Maringaense Electronica Argentina Gothic Symphonic Metal Sparkle 3 Genesis Coptic Hymn Saskatchewan Indie Yam Teen Drama Cane Paratore Schweizer Rap Reggae Tico Disco No. 1 City of Heroes Techno Kitten Adventure Madagascar Belarus Rock In Opposition Israeli Pop Basketball Psychedelic Trance Scottish Rock New Ghostbusters II Pop Dance Moravian Folk Mexican Pop One Person Band Burmese Western Americana Sweetcorn Apple Tolkien Metal Genesis Motor 2048 Descent']
profil5 = [ "Drumfunk New Wave Of Thrash Metal Jazmin Sour Kart Racing Irish Accordion Albanian Hip Hop Theremin Brussels Sprouts Realsports Tennis Oc Rap Splatoon Wolfpack Niue Illinois Watch Company Thai Teen Pop Volkstumliche Musik ASUAG Alexander Shorokhoff Belfast Metal Yemeni Pop Bulgarian Hip Hop Rock Kapak Russian Dance Pop Gose Shabad Dunker New Romantic Georgian Pop Aguaymanto Sour Lithuanian Rock Sparkle 2 Evo King Arthur's Gold Latin Ska Squish 'em Codfish Mexican Hardcore Mullet Korean Hardcore Country Solipsynthm Bellini Frame Denim Faroese Folk Eastern Atlántico Sur Enlightenment Film Gujarati Pop Pinoy Singer Songwriter Cutthroats Roller In Line Hockey"]
profiles = [profil1, profil2, profil3, profil4, profil5]
df=pd.DataFrame(profiles,columns=['profiles'])
df["names"] = ["profil1", "profil2", "profil3", "profil4", "profil5"]
df, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size = text_preprocessing(df) 
get_glove_similarity(df, 0, tfidf_vectors, tokenizer, tfidfvectoriser, tokenized_paded_documents, vocab_size)

Profile: profil1


Similar Profiles using Cosine Similarity


Profile: profil4
Cosine Similarity: 0.9360698700961971


Profile: profil5
Cosine Similarity: 0.9355171719469483


Profile: profil3
Cosine Similarity: 0.9109724133547961


Profile: profil2
Cosine Similarity: 0.9089423243865048


## Training des eigenen Corpus

Vorteile:
- Seltene Wörter wie zum Beispiel Fortnite sind besser repräsentiert
- bigrams können trainiert werden (zum Beispiel "Hatha Yoga" oder "Mountain Biking")
- Akkurate Repraesentation von deutschen Woertern und Ausdruecken

Beschreibung des Trainings: https://github.com/stanfordnlp/GloVe

Quellen
* https://towardsdatascience.com/calculating-document-similarities-using-bert-and-other-models-b2c1a29c9630
* https://nlp.stanford.edu/pubs/glove.pdf
* https://github.com/stanfordnlp/GloVe
* https://en.wikipedia.org/wiki/GloVe
