<a href="https://colab.research.google.com/github/iRoseM/Freelancing-Trends--IT362/blob/main/IT362_Groub_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyzing Freelancing Trends and Sustainability**

## **1. Introduction:**
Freelancing has become a popular career path due to its flexibility and the opportunities it provides across various fields. However, several questions arise regarding its sustainability, income stability, and the demographics of those engaging in this type of work. Understanding freelancing trends can help clarify whether freelancing is a sustainable career choice, identify income patterns, and determine the demographics of freelancers, such as  examining whether freelancing is a sustainable career path, identifying countries with the highest concentration of skilled freelancers and more.


## **2.	Data Sources:**


**Primary dataset** is sourced through using web scraping method on [freelancer.com](https://www.freelancer.com) website. Web scraping is a technique for extracting unstructured data from websites, enabling efficient large-scale data collection.

The main tool used for scraping is BeautifulSoup.

### - Printing HTML structure to easily navigate the code

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
#import time
# Collect and parse first page
response = requests.get('https://www.freelancer.com/freelancers/')
print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')
print (soup.prettify())



429
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Browser Check
  </title>
  <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&amp;display=swap" rel="stylesheet"/>
  <style>
   body, html {
        height: 100%;
        margin: 0;
        font-family: 'Roboto', sans-serif;
        -webkit-font-smoothing: antialiased;
        line-height: 1.5;
        display: flex;
        justify-content: center;
        align-items: center;
        text-align: center;
    }
    .container {
        width: 1200px;
        padding: 20px;
    }
    .logo img {
        width: 200px;
        height: auto;
    }
    .message {
        margin-top: 20px;
    }
    .message .title {
      font-weight: 700;
      font-size: 24px;
      margin-bottom: 40px;
    }
    .footer {
      margin-top: 50px;
      margin-bottom: 40px;
    }
  </style>
  <script>
   function onReCaptchaLoad(

#### - Data retrieval:
We extracted data by scraping the website's HTML elements, targeting specific tags to retrieve the relevant information.

In [None]:
# 1. Importing libraries
import requests  # To send HTTP requests and retrieve web page content.
from bs4 import BeautifulSoup  # Parses HTML and extracting data efficiently.
import pandas as pd  # Manipulate the data using DataFrames for easy analysis.

# 2. Determine number of objects wanted
num_pages = 100
base_url = "https://www.freelancer.com/freelancers/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

data = []

# 3. Retrieving data from HTML elements from website
for page in range(1, num_pages + 1):
    url = f"{base_url}{page}"
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        freelancers = soup.find_all('div', class_='directory-freelancer-item-container')

        for freelancer in freelancers:
            # Extract username
            name = freelancer.find('a', class_='find-freelancer-username')
            name = name.text.strip() if name else 'N/A'

            # Extract hourly rate
            hourly_rate = freelancer.find('span', class_='user-hourly-rate freelancer-hourlyrate')
            hourly_rate = hourly_rate.text.strip() if hourly_rate else 'N/A'

            # Extract skills
            skills = [skill.text.strip() for skill in freelancer.find_all('a', style='color:black;')]
            skills = ', '.join(skills) if skills else 'N/A'

            # Extract location
            location = freelancer.find('div', class_='user-location')
            location = location.text.strip() if location else 'N/A'

            # Extract bio
            bio = freelancer.find('div', class_='bio cleanProfile')
            bio = bio.text.strip() if bio else 'N/A'

            # Extract Rating
            rating_tag = freelancer.find('span', class_='Rating Rating--labeled')
            rating = rating_tag.get('data-star_rating', 'N/A') if rating_tag else 'N/A'

            earnings = freelancer.find('div', class_='Earnings')
            earnings = earnings.text.strip() if earnings else 'N/A'

            # Extract Reviews
            reviews_tag = freelancer.find('a', class_='directory-freelancer-rating-mobile')
            if reviews_tag:
                reviews_text = reviews_tag.text.strip()
                reviews = reviews_text.split(' ')[0] if 'reviews' in reviews_text else 'N/A'
            else:
                reviews = 'N/A'

            data.append({
                'Freelancer Name': name,
                'Hourly Rate': hourly_rate,
                'Skills': skills,
                'Location': location,
                'Rating': rating,
                'Reviews': reviews,
                'Total Earnings': earnings,
                'Bio': bio
            })
    else:
        print(f"Failed to retrieve page {page}, status code: {response.status_code}")

# 4. Saving data into CSV file and printing sample frame
df = pd.DataFrame(data)
df.to_csv('freelancer_data.csv', index=False)

print(df.head())

Failed to retrieve page 80, status code: 429
   Freelancer Name       Hourly Rate  \
0         hirujiyu  $15 USD per hour   
1       mikehurley  $15 USD per hour   
2  weblinkbuilding   $6 USD per hour   
3       Ibrahim185  $20 USD per hour   
4        ITYPETech  $35 USD per hour   

                                              Skills    Location Rating  \
0  Android, Mobile App Development, iPhone, iPad,...       India    5.0   
1  SEO, Link Building, Google Adwords, WordPress,...       India    4.9   
2  SEO, Internet Marketing, Link Building, Market...       India    4.9   
3  Data Entry, Excel, Web Search, Data Processing...  Bangladesh    5.0   
4  Graphic Design, Logo Design, Photoshop, Brochu...       India    5.0   

  Reviews Total Earnings                                                Bio  
0     786           10.0  At DaydreamSoft Infotech LLP, we excel in tran...  
1    6958           10.0  6800 ⭐⭐⭐⭐⭐+ Reviews,\n\n2000+ satisfied client...  
2    3362           10.0  Whi

### **Dataset overview**
This section summarizes the dataset, and introduce some basic information about our dataset, such as its size, head, and null values, data types.

#### - Dataset Size:
Our dataset have 700 rows and 6 columns, where each row in the dataset represents a freelancer experience, with columns detailing their relevant characteristics.

In [None]:
# number of rows and columns
df.shape

(990, 8)

#### - Dataset general information:



In [None]:
# print main dataset info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Freelancer Name  990 non-null    object
 1   Hourly Rate      990 non-null    object
 2   Skills           990 non-null    object
 3   Location         990 non-null    object
 4   Rating           990 non-null    object
 5   Reviews          990 non-null    object
 6   Total Earnings   990 non-null    object
 7   Bio              990 non-null    object
dtypes: object(8)
memory usage: 62.0+ KB
None


| Column Name      | Description                                   | Data Type | Possible Values                                  |
|------------------|-----------------------------------------------|-----------|--------------------------------------------------|
| Freelancer Name  | The name of the freelancer, used to personalize their profile on the platform.                        | Object    | String values representing freelancer names      |
| Hourly Rate      | The amount a freelancer charges per hour of work.             | Object    | String (e.g., "$25", "N/A")                      |
| Skills           |  A list of specific abilities, expertise, or proficiencies that a freelancer possesses (e.g., graphic designer, video editor.. etc). | Object    | Comma-separated string of skills or "N/A"        |
| Location         | The freelancer's country of residence or work (e.g., Lebanon, UK.. etc).                    | Object    | String (e.g., "New York", "N/A")                 |
| Rating          | The average rating given to the freelancer based on client reviews and feedback.             | Object    |String (e.g., "4.5", "N/A")                |
| Reviews          | The number of feedback received from clients reflecting a freelancer's performance and quality of work.             | Object    | String (e.g., "5 reviews", "N/A")                |
| Total Earnings              | The Earnings Score represents a freelancer’s overall earnings from the projects and contests they have successfully completed on the site.               | Object    | String (e.g., "7.8", "N/A") |
| Bio              | A brief description written by the freelancer outlining his/her background, experience, and professional journey.               | Object    | String (e.g., "Experienced software developer", "N/A") |

 **As shown in code**, all columns are of the "object" data type, which means that the column contains mixed data types. This includes freelancer names, hourly rates, skills, locations, reviews, and bios.

#### - Number of missing values:
As shown below, number of nulls is 0 for all attributes, which means that we have no null or missing data for any of the columns. As a result, there is no need to fill or drop any of the columns or the rows.

In [None]:
# Number of missing value in the dataset
df.isnull().sum()

Unnamed: 0,0
Freelancer Name,0
Hourly Rate,0
Skills,0
Location,0
Rating,0
Reviews,0
Total Earnings,0
Bio,0


#### - Number of duplicates:

In [None]:
# Number of duplicated rows
print("Total number of duplicated rows: " + str(sum(df.duplicated())))

Total number of duplicated rows: 10


In [None]:
df[df['Freelancer Name'].duplicated(keep=False)]


Unnamed: 0,Freelancer Name,Hourly Rate,Skills,Location,Rating,Reviews,Total Earnings,Bio
409,zainalitariq245,$20 USD per hour,"PHP, iPhone, Android, Mobile App Development, ...",Pakistan,5.0,36,6.5,Greetings!\n\nI have considerable experience i...
410,zainalitariq245,$20 USD per hour,"PHP, iPhone, Android, Mobile App Development, ...",Pakistan,5.0,36,6.5,Greetings!\n\nI have considerable experience i...
449,OP3NSOURC3,$35 USD per hour,"Website Design, PHP, HTML, Graphic Design, Wor...",Pakistan,5.0,311,9.1,"We have been designing, developing, customizin..."
450,OP3NSOURC3,$35 USD per hour,"Website Design, PHP, HTML, Graphic Design, Wor...",Pakistan,5.0,311,9.1,"We have been designing, developing, customizin..."
509,Istehsanimtenan,$100 USD per hour,"Article Writing, Content Writing, Copywriting,...",Pakistan,5.0,7,4.2,I’m Dr. Abdul Hannan (PhD) Data Analyst and Re...
510,Istehsanimtenan,$100 USD per hour,"Article Writing, Content Writing, Copywriting,...",Pakistan,5.0,7,4.2,I’m Dr. Abdul Hannan (PhD) Data Analyst and Re...
669,BrightSolution2,$15 USD per hour,"PHP, HTML, WordPress, Website Design, MySQL",India,4.9,507,7.9,As an accomplished and highly skilled website ...
670,BrightSolution2,$15 USD per hour,"PHP, HTML, WordPress, Website Design, MySQL",India,4.9,507,7.9,As an accomplished and highly skilled website ...
689,Champian,$25 USD per hour,"Mobile App Development, Website Design, PHP, G...",India,5.0,343,8.6,Saturncube Technologies\n---------------------...
690,Champian,$25 USD per hour,"Mobile App Development, Website Design, PHP, G...",India,5.0,343,8.6,Saturncube Technologies\n---------------------...


As shown in above output, there are 10 duplicated rows. After reviewing the duplicates, we found that the 'Freelancer Name' which represents the unique username, was repeated. Therefore, we’ve decided to drop the duplicate rows to ensure data consistency, as having duplicate primary keys could lead to issues with data integrity and the accurate identification of users.

In [None]:
df.drop_duplicates(subset=['Freelancer Name'], keep='first', inplace=True)
df[df['Freelancer Name'].duplicated(keep=False)]
sum(df.duplicated())

0

After handling the duplicates, we calculated the sum again to ensure that all duplicates were properly addressed.