## Introduction

Its being a while since people have been talking that the IT sector has been increasing due to the advancement of technology. So today I aimed to find out how much the industry grew from 2004 to 2015. 

I got the dataset from kaggle. It is a rich dataset. the data contains job posts from 2004 to 2015. the data was funded by American University of Armenia’s research grant in 2015. since it is published by American University of Armenia’s research grant, most of its data are Armenian based data. 

[This is the link to the dataset](https://www.kaggle.com/madhab/jobposts)

In [176]:
# import libraries
import pandas as pd
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = TweetTokenizer()
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))

In [177]:
# Importing data
data = pd.read_csv("./job_posts_data.csv")
# print the imported data
data


Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False


### columns to drop
---
- jobpost 
- AnnouncementCode
- StartDate
- Duration
- JobDescription
- JobRequirnments
- ApplicationP
- Notes
- AboutC
- Attach
- OpeningDate
- Deadline
- Salary
- Eligibility
---

Here, we are dropping these eleven columns to clean up our data and to make working with it easier. the data columns that are dropped here would not make any impact on our aim.

jobpost is basically the original job post meaning that information that is displayed in the posting website. It includes information like company name, job description, job requirnments, job responsibilities and so on we obviously do not require this column as we have other columns that we are going to use to achieve our aim

AnnouncementCode is a code of some sort of code that is used by the people who shared the data. The values of this columns are mostly empty or nan. it basically doesn't provide any usefull information.

StartDate is the date on which an individual would start working. Most of the values for this column are empty but there are either dates or Immediately or as soon as possible. 

Duration is the duration of the employment. 

JobDescription is the description of the job. 

JobRequirment are the job requirnments. 

ApplicationP is the prodcedure of applying for this job.

Salary is the salary offered.

Notes are the additional notes that the company wants the applicant to know.

AboutC is more about the company.

Attach are the attachments that are expected by the company from the applicant.

OpeningDate is the date on which the job post was opened.

Deadline is the date on which the job post would close.

Eligibility is the eligibility of the candidates

In [178]:
# list of columns that are to be dropped
columns_to_drop = [
    "jobpost", 
    "AnnouncementCode", 
    "StartDate", 
    "Duration",  
    "JobDescription", 
    "JobRequirment",
    "Salary",
    "ApplicationP", 
    "Notes",
    "AboutC",
    "Attach",
    "OpeningDate",
    "Deadline",
    "Eligibility"
    ]

# dropping columns
filtered_data = data.drop(columns=columns_to_drop)
# print filtered data
filtered_data


Unnamed: 0,date,Title,Company,Term,Audience,Location,RequiredQual,Year,Month,IT
0,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,"Yerevan, Armenia","To perform this job successfully, an\r\nindivi...",2004,1,False
1,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,"IREX Armenia Main Office; Yerevan, Armenia \r\...",- Bachelor's Degree; Master's is preferred;\r\...,2004,1,False
2,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,"Yerevan, Armenia","- Degree in environmentally related field, or ...",2004,1,False
3,"Jan 7, 2004",BCC Specialist,Manoff Group,,,"Manila, Philippines","- Advanced degree in public health, social sci...",2004,1,False
4,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,"Yerevan, Armenia",- University degree; economical background is ...,2004,1,True
...,...,...,...,...,...,...,...,...,...,...
18996,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,Full-time,,"Yerevan, Armenia",- At least 5 years of experience in Interface/...,2015,12,False
18997,"Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",Full-time,,"Yerevan, Armenia","- University degree, ideally business related;...",2015,12,False
18998,"Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",Full-time,,"Yerevan, Armenia","- Degree in Business, Marketing or a related f...",2015,12,False
18999,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,"Yerevan, Armenia",- At least 1 year of experience in online sale...,2015,12,False


In [179]:
# get all years on which the the jobs we posted 
years = pd.unique(filtered_data["Year"])
# print all years
years

array([2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015])

In [180]:
"""
    a function that returns a column from the data grouped by a particular column
"""
def getColumnByGroup (data, group_by, column_name):
    return data.groupby(group_by)[column_name]

In [181]:
""" 
    a function that takes in data in the form of a dataframe and groups them by 
    the year they were posted on and then counts number of IT jobs and non IT jobs 
"""

def getITJobCounts(data) :
    # group by year and get the IT column
    grouped_by_year = getColumnByGroup(data, "Year", "IT")
    # list to store all dicitionaries with counted data
    job_counts = []
    # iterate over all grouped data
    for group in grouped_by_year:
        # count number of IT jobs and non IT jobs
        # basically the IT column has a true if the job was a IT job and 
        # false if it is not
        # eg {year, ITJobs, NonITJobs, totalJobs}
        trueFalseCounts = group[1].value_counts()
        # convert all the data into a dictionary
        subCounter = {
            "year": group[0],
            "ITJobs": trueFalseCounts[1],
            "NonITJobs": trueFalseCounts[0],
            "totalJobs": trueFalseCounts[1] + trueFalseCounts[0]
        }
        # add the dictionary to the list
        job_counts.append(subCounter)
    # return the list of dictionaries
    return job_counts

In [182]:
"""
    a function to calculate the percentage by getting the number of jobs and the 
    total number of jobs
    basic formula used here is (jobs / total_number_of_jobs) * 100
"""
def calculatePercentage (num_of_jobs, total_num_jobs):
    # if total number of jobs is 0, meaning that it is an invalid number
    if total_num_jobs == 0:
        print("Cannot calculate percent where the total number of jobs is 0")
        return
    # divide number of jobs by total number of jobs
    division = num_of_jobs / total_num_jobs
    # multiply the divided value by 100 to get percentage and return it
    return division * 100

In [183]:
"""
    a function that calculates the percentage differnce at two certain points 
    basic formula used here is ((final_percentage - initial_percentage) / initial_percentage) * 100
"""
def percentageIncreaseDecrease (initial, final):
    percentage_difference = (final - initial) / initial
    percent = percentage_difference * 100
    # Print results based on the output
    if percent > 0:
        print("increase in percentage =", percent, "%")
    elif percent == 0:
        print ("There was no increase or decrease")
    else:
        print("decrease in percentage =", abs(percent), "%")

In [184]:
# get job counts from the getITJobCounts function
job_data_by_year = getITJobCounts(filtered_data)

# create a list that stores the year with the percentage of IT jobs posted
percentage_list = []

# iterate over job counts
for jobs in job_data_by_year:
    # add a new dictionary to percentage list that contains the year and 
    # the percentage of IT jobs per year
    percentage_list.append({
        "year": jobs["year"],
        "ITJobPercentage" : calculatePercentage(jobs["ITJobs"], jobs['totalJobs'])
    })

In [185]:
# print the list of years with percentages
percentage_list

[{'year': 2004, 'ITJobPercentage': 16.57142857142857},
 {'year': 2005, 'ITJobPercentage': 18.27768014059754},
 {'year': 2006, 'ITJobPercentage': 20.161290322580644},
 {'year': 2007, 'ITJobPercentage': 20.611183355006503},
 {'year': 2008, 'ITJobPercentage': 18.711484593837536},
 {'year': 2009, 'ITJobPercentage': 13.014273719563393},
 {'year': 2010, 'ITJobPercentage': 14.824619457313037},
 {'year': 2011, 'ITJobPercentage': 19.269298762522098},
 {'year': 2012, 'ITJobPercentage': 22.010237319683572},
 {'year': 2013, 'ITJobPercentage': 18.81533101045296},
 {'year': 2014, 'ITJobPercentage': 23.34846192637418},
 {'year': 2015, 'ITJobPercentage': 25.38576406172225}]

In [186]:
# calculate the percentage difference in the first and the last year in this case 
# the first year is 2004 and the last year is 2015
diff_percent = percentageIncreaseDecrease(
    percentage_list[0]["ITJobPercentage"], 
    percentage_list[-1]["ITJobPercentage"]
)

increase in percentage = 53.18995554487566 %


That was the IT Job percentage. That was a significant increase over 11 years. What about non IT jobs?. did all other industries combined together increase as well? did all other combined industries increase more than IT industry. lets find out

In [187]:
# a list to store dictionaries each containing a year and a percentage of non IT jobs
non_IT_percentage_list = []
# iterate over all years with IT and non IT job counts 
for data in job_data_by_year:
    # add a new dictionary to percentage list that contains the year and 
    # the percentage of IT jobs per year
    non_IT_percentage_list.append({
        "year": data["year"],
        "percentage": calculatePercentage(
            data["NonITJobs"],
            data['totalJobs']
            )
    })

# print the list
non_IT_percentage_list

[{'year': 2004, 'percentage': 83.42857142857143},
 {'year': 2005, 'percentage': 81.72231985940246},
 {'year': 2006, 'percentage': 79.83870967741935},
 {'year': 2007, 'percentage': 79.3888166449935},
 {'year': 2008, 'percentage': 81.28851540616246},
 {'year': 2009, 'percentage': 86.98572628043661},
 {'year': 2010, 'percentage': 85.17538054268697},
 {'year': 2011, 'percentage': 80.7307012374779},
 {'year': 2012, 'percentage': 77.98976268031642},
 {'year': 2013, 'percentage': 81.18466898954703},
 {'year': 2014, 'percentage': 76.65153807362583},
 {'year': 2015, 'percentage': 74.61423593827774}]

In [188]:
# calculate the percentage differnce
non_IT_percen_diff = percentageIncreaseDecrease(
    non_IT_percentage_list[0]["percentage"],
    non_IT_percentage_list[-1]["percentage"]
)

decrease in percentage = 10.56512815617394 %


It is fair enough to say that the IT industry would increase much more by the coming years from the growth percentages shown above. 

This was all about the percentage. What about job title? which job title is the most occured?


In [189]:
# get job titles and then separate them based on if they are from the IT sector or not
job_titles = filtered_data.groupby("IT")["Title"].unique()
# store the non IT job titles in a variable
non_IT_job_titles = job_titles[0]
# store the IT job titles in a variable
it_job_titles = job_titles[1]


In [190]:
# split all words to find the most occured and most common words
title_splits_IT_jobs = []

# iterate over all the titles and tokenize them 
for title in it_job_titles:
    words_list = tokenizer.tokenize(title)
    for word in words_list:
        title_splits_IT_jobs.append(word)
# use FreqDist to find out the most occured words 
fdist = FreqDist([ word for word in title_splits_IT_jobs if word not in stop_words ])
# print the most occured title
print ("most occured word in the titles column is", fdist.max()) 
# print the 20 most common titles 
fdist.most_common(20)

most occured word in the titles column is Developer


[('Developer', 529),
 ('Software', 331),
 ('Senior', 255),
 ('Engineer', 230),
 ('C', 174),
 ('Specialist', 147),
 ('Administrator', 91),
 ('Web', 85),
 ('Java', 81),
 ('IT', 79),
 ('Programmer', 70),
 ('NET', 69),
 ('Database', 68),
 ('Department', 57),
 ('Manager', 56),
 ('Designer', 55),
 ('Unit', 50),
 ('PHP', 49),
 ('Network', 46),
 ('ASP', 46)]

In [191]:
# split all words to find the most occured and most common words
title_splits_non_IT_jobs = []

# iterate over all the titles and tokenize them 
for title in non_IT_job_titles:
    words_list = tokenizer.tokenize(str(title))
    for word in words_list:
        title_splits_non_IT_jobs.append(word)

# use FreqDist to find out the most occured words 
fdist_non_it = FreqDist([ word for word in title_splits_non_IT_jobs if word not in stop_words ])
# print the most occured title
print ("most occured word in the titles column is", fdist_non_it.max()) 
# print the 20 most common titles 
fdist_non_it.most_common(20)

most occured word in the titles column is Specialist


[('Specialist', 1064),
 ('Manager', 968),
 ('Assistant', 537),
 ('Head', 461),
 ('Department', 436),
 ('Engineer', 398),
 ('Development', 387),
 ('Officer', 386),
 ('Senior', 375),
 ('Coordinator', 341),
 ('Expert', 317),
 ('Sales', 316),
 ('Project', 304),
 ('Marketing', 266),
 ('Director', 247),
 ('Program', 245),
 ('Division', 237),
 ('Consultant', 221),
 ('Financial', 220),
 ('Service', 205)]

As we can see that in case of IT job titles the most occured words are Developer and Software where as the most common occured words in non IT job titles are Specialist and manager.

Let us now see how much percentage of all jobs posted are specified to be part time or full time. This data is obviously not vert accurate as there are many empyt values in this column. this is a rough figured percentage

In [192]:
# get the Term column
term_data = filtered_data["Term"]
# variables to store the total number of rows, part time and full time counts
totalRows = 0
partTimeCounter = 0
fullTimeCounter = 0

# count the number of jobs that are part time and full time
for row in term_data:
    totalRows += 1
    if row == "Full time" or row == "Full Time" or row == "Full-time" or row == "Full-Time":
        fullTimeCounter += 1
    if row == "Part time" or row == "Part Time" or row == "Part-time" or row == "Part-Time":
        partTimeCounter += 1

    if row == "Part-time or Full-time" :
        fullTimeCounter += 1
        partTimeCounter += 1
    

# print the stats
print(partTimeCounter, fullTimeCounter, totalRows)

300 6636 19001


In [193]:
# calculate part time job percentage
partTimePercentage = calculatePercentage(partTimeCounter, totalRows)
print("Percentage of part time jobs from the dataset", partTimePercentage, "%")

Percentage of part time jobs from the dataset 1.578864270301563 %


In [194]:
# calculate full time job percentage
fullTimePercentage = calculatePercentage(fullTimeCounter, totalRows)
print("Percentage of full time jobs from the dataset", fullTimePercentage, "%")

Percentage of full time jobs from the dataset 34.924477659070575 %


In [195]:
# other specified or not specified job percentage
otherJobPercentage = calculatePercentage(totalRows - partTimeCounter - fullTimeCounter, totalRows)
print("Percentage of other jobs from the dataset", otherJobPercentage, "%")

Percentage of other jobs from the dataset 63.496658070627866 %


## conclusion

I came to a conclusion that the job posts in the IT sector have increased by approximately 53% from 2004 to 2015 in Armenian since the data was Armenian based. If we closely inspect the location column we would clearly find that almost all the job posts have the location of Armenia. 

However the full time and part time percentages calculated were not very acurate because of reason that many of the values were empty. it worked but then there were different types of terms that I did not take into consideration such as permanent.