# Text Mining for common words

I produced a .txt file that has the job skill keywords for 73 jobs listed via Linkedin. 
The file has 730 lines, where each line is a skill phrase.

For example, skills are listed as:

```
Cloud Computing
Learning
Microsoft Azure
Mobile Application Development
```

**To Do**   
- It would be interesting to find the most and least common words.   
- It would also be good to find the most and least common skill phrases.   

### The 20 most common words are:

In [7]:
import urllib.request
import re
from collections import Counter

url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"

# Download the file contents
response = urllib.request.urlopen(url)
data = response.read().decode("utf-8")

# Tokenize the text into words and count the occurrences of each word
words = re.findall(r'\b\w+\b', data.lower())
word_counts = Counter(words)

# Get the 20 most common words
most_common = word_counts.most_common(20)

# Print the results
print("The 20 most common words are:")
for word, count in most_common:
    print(word, count)


The 20 most common words are:
data 165
science 58
analytics 56
analysis 45
language 40
programming 32
skills 28
sql 26
analytical 25
learning 23
python 23
communication 23
computer 21
business 21
visualization 21
management 19
microsoft 17
machine 16
modeling 16
databases 15


### The 20 least common words are

In [10]:
import urllib.request
import re
from collections import Counter

url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"

# Download the file contents
response = urllib.request.urlopen(url)
data = response.read().decode("utf-8")

# Tokenize the text into words and count the occurrences of each word
words = re.findall(r'\b\w+\b', data.lower())
word_counts = Counter(words)

# Get the 20 least common words
least_common = word_counts.most_common()[:-21:-1]

# Print the results
print("The 20 least common words are:")
for word, count in least_common:
    print(word, count)



The 20 least common words are:
time 1
principles 1
accounting 1
statutory 1
products 1
improvement 1
collection 1
programs 1
training 1
rpa 1
automation 1
robotic 1
application 1
mobile 1
cloud 1
models 1
ssrs 1
oracle 1
mis 1
actuarial 1


### The 20 most common phrases

In [9]:
import requests
from collections import Counter

# Fetch data from the URL
url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"
response = requests.get(url)

# Extract the text and split into lines
text = response.text.strip()
lines = text.split("\n")

# Quote each line
quoted_lines = [f'"{line}"' for line in lines]

# Count the occurrences of each quoted phrase
phrase_counts = Counter(quoted_lines)

# Print the 20 most common phrases
for phrase, count in phrase_counts.most_common(20):
    print(f"{phrase}: {count}")


"Data Science": 37
"Data Analytics": 26
"Data Analysis": 24
"Analytical Skills": 23
"SQL": 22
"Python (Programming Language)": 22
"Computer Science": 19
"Communication": 19
"Analytics": 16
"Machine Learning": 15
"Data Visualization": 14
"Statistics": 13
"Natural Language Processing (NLP)": 10
"Databases": 10
"Data Modeling": 9
"Business Intelligence (BI)": 7
"Data Mining": 6
"Data Cleaning": 6
"Dashboard": 6
"Deep Learning": 6


### The 20 least common phrases

In [17]:
import requests
from collections import Counter

# Fetch data from the URL
url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"
response = requests.get(url)

# Extract the text and split into lines
text = response.text.strip()
lines = text.split("\n")

# Quote each line
quoted_lines = [f'"{line}"' for line in lines]

# Count the occurrences of each quoted phrase
least_common_phrases = phrase_counts.most_common()[-20:]

# Print the least common phrases
print("The 20 least common phrases in the file are:\n")
for phrase, count in least_common_phrases:
    print(f"{phrase}: {count}")


The 20 least common phrases in the file are:

"Actuarial Science": 1
"Management Information Systems (MIS)": 1
"Database Administration": 1
"Oracle Database": 1
"Query Writing": 1
"Reporting Requirements": 1
"SQL Server Reporting Services (SSRS) ": 1
"Data Models": 1
"Cloud Computing": 1
"Learning": 1
"Mobile Application Development": 1
"Robotic Process Automation (RPA)": 1
"Training Programs ": 1
"Data Collection": 1
"Process Improvement": 1
"Quality Management": 1
"SAP Products": 1
"Statutory Accounting Principles (SAP) ": 1
"Microsoft Power Query": 1
"Time Management": 1


### Count the number of unique words

In [11]:
import requests

# Fetch data from the URL
url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"
response = requests.get(url)

# Extract the text and split into words
text = response.text.strip()
words = text.split()

# Count the number of unique words
unique_words = set(words)
num_unique_words = len(unique_words)

# Print the result
print(f"There are {num_unique_words} unique words in the file.")


There are 350 unique words in the file.


### The number of unique phrases

In [12]:
import requests

# Fetch data from the URL
url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"
response = requests.get(url)

# Extract the text and split into lines
text = response.text.strip()
lines = text.split("\n")

# Quote each line
quoted_lines = [f'"{line}"' for line in lines]

# Count the number of unique quoted phrases
unique_quoted_lines = set(quoted_lines)
num_unique_quoted_lines = len(unique_quoted_lines)

# Print the result
print(f"There are {num_unique_quoted_lines} unique quoted phrases in the file.")


There are 296 unique quoted phrases in the file.


### Print all the phrases

In [19]:
import requests
from collections import Counter

# Fetch data from the URL
url = "https://raw.githubusercontent.com/mccurcio/ds_text_mining/main/73_ds_job_skills.txt"
response = requests.get(url)

# Extract the text and split into lines
text = response.text.strip()
lines = text.split("\n")

# Quote each line
quoted_lines = [f'"{line}"' for line in lines]

# Count the occurrences of each quoted phrase
phrase_counts = Counter(quoted_lines)

# Print all the phrases sorted by their counts
print("All the phrases in the file sorted by their counts are:")
for phrase, count in phrase_counts.most_common():
    print(f"{phrase}: {count}")


All the phrases in the file sorted by their counts are:
"Data Science": 37
"Data Analytics": 26
"Data Analysis": 24
"Analytical Skills": 23
"SQL": 22
"Python (Programming Language)": 22
"Computer Science": 19
"Communication": 19
"Analytics": 16
"Machine Learning": 15
"Data Visualization": 14
"Statistics": 13
"Natural Language Processing (NLP)": 10
"Databases": 10
"Data Modeling": 9
"Business Intelligence (BI)": 7
"Data Mining": 6
"Data Cleaning": 6
"Dashboard": 6
"Deep Learning": 6
"Tableau ": 6
"Pattern Recognition": 5
"Predictive Analytics ": 5
"Predictive Modeling": 5
"R (Programming Language)": 5
"Visualization ": 5
"Problem Solving": 5
"A/B Testing": 4
"Mathematics": 4
"Statistical Analysis": 4
"Business Analysis": 4
"Database Design": 4
"Extract, Transform, Load (ETL)": 4
"Project Management": 4
"Critical Thinking": 4
"Microsoft Excel": 4
"Artificial Intelligence (AI)": 3
"Amazon Web Services (AWS)": 3
"Azure Databricks": 3
"Microsoft Azure": 3
"Experimental Design": 3
"SAS (Soft