# Recommendation system using NLP (English resume)
This documentation was entirely written in English for other presentation purposes!!!

Note: (با سلام استاد امینی، من به همراه این فایل یک یا چند فایل صوتی هم ارسال می‌کنم که هر قسمت را به صورت جداگانه توضیح می‌دهم)

# Imports

In [1]:
import pandas as pd
import numpy as np
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize


# Dataset imports
In addition to the main dataset (pii_dataset) we need to import Persian stop words as well. (stop words contain words that have 0 weight. in other words, we need to remove them since they do not have any actual value or meaning. In this case I used NLTK stop words in addition to my previous method (using CountVectorizer to sum up the most used words in the document and adding them to the  stop words list)

In [2]:
df = pd.read_csv('pii_dataset.csv')
stop_words = pd.read_csv('eng_stp_words')

# We need to convert it into a list for better compatibility with re library

stop_words_list = stop_words['words'].tolist()

df['text'] = df['text'].fillna('')

df = df.astype(str)


pd.set_option('display.max_colwidth', None)

In [3]:
df['text'].head()

0    My name is Aaliyah Popova, and I am a jeweler with 13 years of experience. I remember a very unique and challenging project I had to work on last year. A customer approached me with a precious family heirloom - a diamond necklace that had been passed down through generations. Unfortunately, the necklace was in poor condition, with several loose diamonds and a broken clasp. The customer wanted me to restore it to its former glory, but it was clear that this would be no ordinary repair. Using my specialized tools and techniques, I began the delicate task of dismantling the necklace. Each diamond was carefully removed from its setting, and the damaged clasp was removed. Once the necklace was completely disassembled, I meticulously cleaned each diamond and inspected it for any damage. Fortunately, the diamonds were all in good condition, with no cracks or chips. The next step was to repair the broken clasp. I carefully soldered the broken pieces back together, ensuring that the clasp 

In [4]:
stop_words

Unnamed: 0,words
0,i
1,me
2,my
3,myself
4,we
...,...
122,will
123,just
124,don
125,should


# CountVectorizer
We can find and add the most used words in our dataset using this method. Removing these words gives us a cleaner and more accurate results in the end.

**note: after a couple of tries I realized the English dataframe contains more meaningless words but all of them could not be removed since it disrupts the process of the word2vec model (I could have gone for more words for instance 500 but it would have made the model less accurate because some of the words in between were actually valuable)

In [5]:

# initialize CountVectorizer
vectorizer = CountVectorizer()

# fit and transform the text column
X = vectorizer.fit_transform(df['text'])

# get the feature names
feature_names = vectorizer.get_feature_names_out()

# sum up the occurrences of each word
word_counts = X.sum(axis=0)

# convert the word counts to a 1D array
word_counts_array = np.squeeze(np.asarray(word_counts))

# create a DataFrame with word counts and their corresponding feature names
word_counts_df = pd.DataFrame({'Word': feature_names, 'Count': word_counts_array})

# sort the dataframe by word counts to see the most frequent words
most_common_words = word_counts_df.sort_values(by='Count', ascending=False)

# extract the top 30 most common words from the dataframe
top_common_words = most_common_words.head(100)['Word'].tolist()

# append the top common words to your existing stop words list
stop_words_list.extend(top_common_words)

most_common_words.head(500)



Unnamed: 0,Word,Count
25936,the,73872
6798,and,59331
26163,to,47891
19678,of,36856
19096,my,24767
...,...,...
11282,did,379
7144,areas,377
22561,resilience,376
17198,lead,374


# Text cleaning process
Using the re library we can remove stop words from our data in the text column and remove all of the special characters as well.(this one differs from the Persian cleaner presented in the other file. In the other project the cleaner faced some issues when it came to removing stop words so I had to take a different approach. This one on the other hand was simpler.)

In [6]:
def remove_word_eng(text):
    # Clean the text
    clean = re.sub(r'\s+', ' ', text)
    clean = re.sub(r'[^\w\s]', '', clean)
    
    # Remove stop words
    words = clean.split()
    
    filtered_words = [word for word in words if word.lower() not in stop_words_list]
    
    # Join the filtered words back into a single string
    cleaned_text = ' '.join(filtered_words)
    
    return cleaned_text


df['text'].apply(remove_word_eng)

0                                                                                                                                                                                                                                                                                                                                                                                                                                                          Aaliyah Popova jeweler 13 remember unique challenging last year approached precious family heirloom diamond necklace passed generations Unfortunately necklace poor condition several loose diamonds broken clasp wanted restore former glory clear ordinary repair Using specialized tools techniques began delicate task dismantling necklace diamond carefully removed setting damaged clasp removed necklace completely disassembled meticulously cleaned diamond inspected damage Fortunately diamonds good condition cracks chips next step repair broken clasp careful

# Applying NLTK stemmer to English words
Since we are using an English document it is more logical to use something more reliable like NLTK's stemmer.(just like the  previous stemmer there are two sorts of stemming but the PorterStemmer approach is the hardware friendly one)

In [7]:
# Define your function to apply the stemmer to English words
def apply_nltk_stemmer(text):
    # Initialize the stemmer
    stemmer = PorterStemmer()

    # Tokenize the text into words
    words = text.split()
    # Apply stemming to each word using the PorterStemmer
    stemmed_words = [stemmer.stem(word) for word in words]
    # Join the stemmed words back into a single string
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

df['text'].apply(apply_nltk_stemmer)

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          my name is aaliyah popova, and i am a jewel with 13 year of experience. i rememb a veri uniqu and challeng project i had to work on last year. a custom approach me with a preciou famili heirloom - a diamond necklac that had been pass down through generations. unfortunately, the necklac wa in poor condition, with sever loos diamond and a broken clasp. the custom want me to restor it to it former glory, but it wa clear that thi would be no ordinari repair. use my special tool and techniques, i began the delic task of dism

# Word2vec 
Again the process is the same. The only changed parts are the cleaner types and the stemmer.

**note: as you can see due to the fact that we used a richer and more dense dataset we get more accurate results based on our input.

In [8]:
def word2vec(column, cleaner, user_input, num_results=10):
    # Initialize Stemmer for Persian language
    stemmer = PorterStemmer()
    
    # Preprocess the text column using the provided cleaner function
    text = column.apply(lambda x: apply_nltk_stemmer(remove_word_eng(x)))
    
    # Tokenize sentences into words
    tokenized_sentences = [word_tokenize(sentence) for sentence in text]
    
    # Train Word2Vec model on tokenized sentences
    model = Word2Vec(tokenized_sentences, vector_size=150, window=6, min_count=10, workers=4)

    # Initialize lists and dictionary for storing similar words
    sim_list = []
    sim_dic = {}

    # Clean and preprocess the user input
    user_input = cleaner(user_input)
    user_input = apply_nltk_stemmer(remove_word_eng(user_input))

    # Append the preprocessed user input to the list of similar words
    sim_list.append(user_input)

    # Find similar words for each word in the user input
    for word in user_input.split():
        if len(word) > 3:
            try:
                # Get the most similar words from the Word2Vec model
                similar_words = model.wv.most_similar(word, topn=5)
                for sim_words, relevance in similar_words:
                    if sim_words not in sim_dic:
                        sim_dic[sim_words] = relevance
                    else:
                        sim_dic[sim_words] += relevance

            except KeyError:
                continue

    # Filter similar words based on relevance score
    print(sim_dic)
    sim_dic_copy = sim_dic.copy()

    for key, relevance in sim_dic_copy.items():
        if float(relevance) < 0.7:
            sim_dic.pop(key, None)

    # Append filtered similar words to the list
    for key, relevance in sim_dic.items():
        sim_list.append(key)

    # Join the list of similar words into a single string
    sim_string = ' '.join(sim_list)
    return sim_string


# Results
Using the function above we notice a huge difference in the relevance scores compared to the other dataset (Persian movies). This shows how much of a difference it makes when we have a bigger dataset and more samples for our model to work with

In [9]:
word2vec(df['text'],remove_word_eng,'a data scientist with alot of experience')

{'dataset': 0.8564954400062561, 'predict': 0.813457727432251, 'statist': 0.7955108284950256, 'model': 0.7820979356765747, 'analyz': 0.7791796326637268, 'institut': 0.8311975002288818, 'biologist': 0.7848685383796692, 'intellig': 0.7843748331069946, 'field': 0.7802547216415405, 'comput': 0.7688971161842346}


'data scientist alot dataset predict statist model analyz institut biologist intellig field comput'

# TF_IDF model
Same process with minor changes to variables.

In [10]:
def tf_idf(X_column, user_input, num_results=10):
    # Initialize TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    
    # Preprocess text data in the input column using stemming and word removal
    X = X_column.apply(lambda x: apply_nltk_stemmer(remove_word_eng(x)))
    
    # Compute TF-IDF matrix for the preprocessed text data
    tfidf_matrix = tfidf_vectorizer.fit_transform(X)

    # Remove specified words from user input
    main_words = remove_word_eng(user_input)
    
    # Get similar words using Word2Vec model
    similar_words_w2v = word2vec(X_column, remove_word_eng, main_words)
    
    # Apply stemming to similar words obtained from Word2Vec
    similar_words_w2v = ' '.join([apply_nltk_stemmer(word) for word in similar_words_w2v.split()])

    # Preprocess user input by applying stemming and combining with similar words
    original_inp = remove_word_eng(user_input)
    original_inp =' '.join([apply_nltk_stemmer(word) for word in original_inp.split()])
    original_inp += " "+ similar_words_w2v
    
    # Filter out short words
    original_inp = ' '.join([word for word  in original_inp.split() if len(word) > 3])
    print(original_inp)

    # Transform user input into TF-IDF vector
    user_tfidf = tfidf_vectorizer.transform([original_inp])
    
    # Compute cosine similarity between user input TF-IDF vector and corpus TF-IDF matrix
    similarities = cosine_similarity(user_tfidf, tfidf_matrix).flatten()
    
    # Sort and get indices of the most similar documents
    similar_result_indices = similarities.argsort()[:-num_results - 1:-1]

    # Extract specific columns from the DataFrame
    similar_results = df.iloc[similar_result_indices][['text']] 
    
    # Return similar movies
    return similar_results


# Note
You might notice that the original input was repeated two times in our output (data scientist with a lot experi data scientist with alot experi...) as far as I know this causes the model to give more  weight to the original text so it does not prioritize the similar words.

# Results
To see how accurate the results are based on the user input press CTRL+F and search for the input words and see how accurate the results are in the text presented in the end. 

In [11]:
tf_idf(df['text'],'a data scientist with alot of experience')

{'dataset': 0.853718638420105, 'predict': 0.837277352809906, 'simul': 0.8302186131477356, 'preprocess': 0.8185425996780396, 'statist': 0.8031793832778931, 'institut': 0.8151922225952148, 'biologist': 0.8009185194969177, 'intellig': 0.7957758903503418, 'comput': 0.7924522757530212, 'geologist': 0.7902709245681763}
data scientist alot data scientist alot dataset predict simul preprocess statist institut biologist intellig comput geologist


Unnamed: 0,text
817,"Hello, my name is Xiang Ivanova, and I'm a data scientist working at a leading technology company. Recently, I was tasked with a complex problem that required me to utilize my analytical skills and expertise in data analysis. The challenge involved understanding and resolving inconsistencies in a large dataset that contained information about customer transactions. Using advanced statistical techniques, I was able to identify patterns and anomalies within the data, leading me to discover errors in the data collection process. To address the issue, I worked closely with the data engineering team to implement corrective measures and ensure the accuracy of future data collection. This involved updating and refining the data validation procedures, improving data quality checks, and introducing automated processes to flag and correct any potential inconsistencies. As a result of these efforts, we were able to significantly reduce the number of errors in the dataset, leading to improved data integrity and enhanced decision-making capabilities for the business. Throughout the process, I communicated regularly with stakeholders, including data analysts, business leaders, and IT professionals, to ensure that my findings and recommendations were clearly understood and aligned with the company's objectives. My email address is xiangivanova@msn.org, and my office address is 1311 Durrett Lane. If you have any further questions or would like to discuss data science or analytics, feel free to reach out to me. I am always eager to share my knowledge and insights with fellow professionals and contribute to the advancement of data-driven decision-making."
3476,"In my role as a data scientist at [Company Name], I embarked on a project that involved developing a predictive model to optimize the company's marketing campaigns. Leveraging my 17 years of experience in the field, I began by gathering and analyzing data from various sources, including customer surveys, social media platforms, and website interactions. To ensure data accuracy and completeness, I employed a rigorous data cleaning and preprocessing methodology. Once the data was prepared, I selected and applied appropriate machine learning algorithms to identify patterns and relationships within the data. This involved experimenting with different algorithms, tuning hyperparameters, and evaluating model performance using various metrics. I discovered that a gradient boosting algorithm yielded the best results. This algorithm was able to capture complex interactions between variables and make accurate predictions. To further improve the model's performance, I utilized feature engineering techniques to extract additional insights from the data. This resulted in a significant improvement in model accuracy. The culmination of my efforts was a robust and reliable predictive model that enabled the company to target its marketing campaigns more effectively. By identifying customers who were most likely to respond to specific marketing messages, the company was able to increase its conversion rates and overall marketing ROI. Throughout the project, I maintained open communication with stakeholders, providing regular updates on my progress and seeking feedback. I also ensured that the model was properly documented and easily accessible to the marketing team for ongoing use. In addition to my work as a data scientist, I am an avid traveler and enjoy exploring new cultures. I am passionate about photography and capturing the beauty of the world around me. If you would like to discuss this project further, please feel free to contact me at +86 19931 4693 or jaimitsubishi@yahoo.gov. I am also available for freelance work and collaborations. [Optional Additional Information] Name: Jai Mitsubishi Address: 711 Tatem Street Hobby: Related"
1291,"As a data scientist, I frequently encounter intricate puzzles that call for creative solutions. One notable challenge I solved involved developing a predictive model for consumer behavior in the retail industry. By leveraging vast datasets and utilizing machine learning techniques, I designed an algorithm that could accurately forecast customer purchasing patterns, leading to optimized inventory management and increased sales. My name is Ming Liu, and I take pride in my work, which has earned me recognition in the field. I can be reached via email at mingliu8741@gmail.gov. I am currently based at 292 Morgan Mountain Suite 350, where I continue to explore the fascinating world of data science. The predictive model I crafted analyzed historical sales data, customer demographics, and market trends to generate valuable insights into consumer behavior. This empowered retailers to tailor their marketing strategies and product offerings to specific customer segments, resulting in enhanced customer satisfaction and loyalty. I am dedicated to pushing the boundaries of data science and unlocking its potential to drive innovation and growth. The successful implementation of this predictive model is a testament to the transformative power of data-driven decision-making and my commitment to delivering tangible results."
1767,"Hi, I'm Hiroko Neumann, a data scientist with a passion for solving complex problems and extracting meaningful insights from data. One particular project I'm proud of involved utilizing machine learning algorithms to predict customer churn for a major telecommunications company. The challenge was to analyze vast amounts of customer data to identify patterns and trends that could help the company proactively identify customers at risk of canceling their service. I spent countless hours gathering, cleaning, and preprocessing data from various sources, including customer demographics, usage history, billing information, and customer satisfaction surveys. Using a combination of supervised and unsupervised learning techniques, I developed a predictive model that accurately identified customers with a high probability of churning. This model leveraged features such as customer tenure, call patterns, data usage, and payment history to provide valuable insights into customer behavior. Armed with these insights, the company was able to implement targeted retention strategies, including personalized offers, improved customer service, and tailored marketing campaigns. As a result, they experienced a significant reduction in customer churn, leading to increased customer satisfaction, improved brand loyalty, and ultimately, higher revenue. Through this project, I demonstrated my expertise in data science and my ability to make a real impact on a business by leveraging data-driven solutions. It's a rewarding feeling to know that my work has helped the company retain valuable customers and optimize their marketing efforts. Feel free to reach me at hirokoneumann6387@msn.com or 1309 Columbia Road Northwest if you have any questions or if you're interested in collaborating on data science projects."
75,"As Mary Dos Santos, I'm proud of the work I've done as a computer scientist. One memorable project I tackled was developing a comprehensive software system to analyze and visualize complex data. The challenge was to create an intuitive interface that allowed users to seamlessly navigate vast datasets and extract meaningful insights. Leveraging my expertise in data mining and machine learning, I designed a solution that utilized advanced algorithms to uncover hidden patterns and correlations within the data. The interactive dashboard I developed enabled users to explore data from multiple perspectives, generating real-time visualizations that made complex information easily comprehensible. The successful implementation of this system at a major financial institution resulted in improved decision-making processes and significant cost savings. It was a rewarding experience to witness the impact of my work in driving tangible business outcomes. Outside of my professional endeavors, I enjoy spending time with my family at our home in 517 Glenpark Drive, indulging in culinary adventures, and exploring the outdoors through hiking and camping. You can reach me via email at mary.dos santos@outlook.edu if you'd like to connect."
573,"In my role as a data scientist, I encounter numerous challenges that require analytical thinking and technical expertise. One particular case that stands out is when I was tasked with analyzing a large dataset of customer feedback to identify key pain points and areas for improvement in our company's services. Using advanced statistical techniques, natural language processing, and machine learning algorithms, I was able to uncover valuable insights hidden within the vast amount of unstructured data. My findings revealed patterns and correlations that led to the identification of specific aspects of our services that needed attention. This analysis enabled our team to make data-driven decisions to improve customer satisfaction and enhance overall service quality. As I delved deeper into the analysis, I stumbled upon an intriguing anomaly within the dataset. A particular product consistently received negative feedback, but the reasons for dissatisfaction were unclear. To unravel this mystery, I employed a combination of qualitative and quantitative methods. I conducted in-depth interviews with customers who had expressed dissatisfaction to gather their firsthand accounts of their experiences. Additionally, I analyzed usage patterns, technical logs, and support tickets related to the product to identify potential root causes of the issues. Through this comprehensive approach, I was able to pinpoint a specific design flaw that was causing the negative feedback. This discovery empowered our engineering team to rectify the issue promptly, resulting in a significant improvement in customer satisfaction for that particular product. My analytical journey as a data scientist is not confined to a single case study. I am constantly engaged in solving complex business problems and uncovering hidden insights from data. As I continue to explore the realm of data, I strive to leverage my expertise to drive innovation and make a positive impact on the organizations I serve. If you wish to reach me, my contact details are as follows: - Name: Thomas Martinez - Email: thomas_martinez@gmail.com - Address: 3115 North Lake Boulevard"
44,"In a project that was particularly intriguing to me, I was tasked with developing a machine learning model to predict customer churn for a telecommunications company. As a seasoned data scientist with 12 years of experience in the field, I was excited to delve into this challenge. To begin, I gathered a vast dataset encompassing various customer-related data points, including demographics, usage patterns, and billing information. With the data securely stored, I embarked on the process of data exploration and preprocessing. I utilized various statistical techniques and visualization tools to identify patterns, trends, and outliers within the dataset. This initial analysis enabled me to gain valuable insights into customer behavior and helped me identify potential predictors of churn. Next, I partitioned the data into training and testing sets to ensure the robustness of the model. I then proceeded to select and tune a suitable machine learning algorithm for the task at hand. Several models were evaluated, including logistic regression, decision trees, and random forests. Hyperparameter optimization was performed to determine the optimal settings for each algorithm. The results were promising. The model achieved a high level of accuracy in predicting customer churn. However, I was not content with merely developing a model; I wanted to understand the underlying factors driving churn. To this end, I employed feature importance analysis to identify the most influential features in the model's predictions. This analysis revealed that factors such as customer tenure, usage patterns, and satisfaction with customer service were significant contributors to churn. With these insights, I collaborated with the marketing and customer service teams to develop targeted interventions aimed at reducing churn. These interventions included personalized offers, improved customer support, and tailored loyalty programs. The implemented strategies proved to be effective, resulting in a significant decrease in customer churn and an increase in customer satisfaction. I am proud of the positive impact this project has had on the telecommunications company. It has enabled them to retain valuable customers, optimize their marketing campaigns, and improve their overall customer experience. If you would like to learn more about my work, feel free to visit my webpage at https://rajeshweber.edu or connect with me on Twitter @rweber. You can also reach me at my address: 78511 Rhodes Parks Suite 200."
3177,"My name is Sushila Roche and I work as a computer scientist at XYZ Company. My job is to find solutions for the company's IT needs and improve the efficiency of their systems. One of the challenges I recently solved was the issue of data integration between our company's different software applications. The company had been using multiple applications for various tasks, but the data was not integrated between them. This led to inconsistencies and errors in the data, making it difficult for the company to make informed decisions. I conducted a thorough analysis of the existing systems and identified the challenges in integrating them. I then researched and evaluated various data integration tools and technologies that could help me achieve the desired results. I implemented a data integration solution that utilized a middleware platform to connect the different applications and enable seamless data transfer between them. I also developed a comprehensive data governance policy to ensure the accuracy, consistency, and security of the integrated data. This successful integration resulted in improved data quality, better decision-making, and increased operational efficiency within the company. My contact details are: Sushila Roche, 80 Homestead Street, sushila_roche6411@aol.com."
271,"Hi, I'm Boris Anderson, a data scientist living at 10406 Sunlight Lane. I'm writing to share an interesting problem I solved at work. We were working on a project to improve the efficiency of our customer service department. We had a lot of data on customer interactions, but it was challenging to identify patterns and trends that could help us make improvements. I used a combination of machine learning and statistical analysis to identify key factors that impacted customer satisfaction. I also developed a predictive model that could help us identify customers who were likely to have a negative experience. This allowed us to prioritize these customers and provide them with additional support. The result was a significant improvement in customer satisfaction and a reduction in the number of customer complaints. This led to a more positive customer experience and increased revenue for the company. If you have any questions about my work, please feel free to contact me at boris.anderson@outlook.com."
3957,"Hello, I'm Lucy Weiss. I have worked as a biologist for the last nine years. In my most recent position at the National Institute of Health, I led a project investigating the effects of climate change on local ecosystems. I began by assembling a team of scientists with expertise in ecology, meteorology, and GIS mapping. We then collected data on local plant and animal populations, as well as climate patterns, over a period of several years. One of the challenges we faced was the sheer volume of data we collected. To address this, we developed a database system that allowed us to store and organize the data in a way that made it easy to analyze. We also used statistical software to identify trends and patterns in the data. Our analysis revealed that climate change was having a significant impact on local ecosystems. We found that rising temperatures were leading to changes in plant and animal distributions, as well as increased frequency and severity of extreme weather events. We presented our findings at a scientific conference, and they were published in a peer-reviewed journal. Our work has helped to raise awareness of the impacts of climate change on local ecosystems, and it has informed policy decisions at the local and national levels. In addition to my work as a biologist, I enjoy weaving as a hobby. I also enjoy spending time with my family and friends. If you would like to get in touch with me, you can reach me by phone at 0178 774 3597, by email at lucyweiss1295@hotmail.gov, or by mail at 19404 North 77th Avenue."
