# Final Project - Linkedin

## Introduction to Data Analytics 2024
### Presented by orel ben naim

#### Data source: https://www.kaggle.com/killbot/linkedin-profiles-and-jobs-data
#### Our GitHub: https://github.com/orel15211/finaly

<img src="https://i.pinimg.com/originals/d3/3b/d9/d33bd9baa83a336184055c07dc8ccaa8.gif" width=700 height=700 align=left />

## I have chosen to analyze LinkedIn, an online social network designed to foster professional and business connections among its users. LinkedIn was established in 2003 and has since become a key tool for professionals in various fields. The platform allows users to create professional profiles, share resumes, showcase skills and achievements, and connect with colleagues, recruiters, and potential employers.

## In this project, I examined network data to identify the factors that contribute to effective use of the platform. LinkedIn helps millions of users worldwide by providing them with tools to build professional networks, find job opportunities, and share knowledge and insights in their fields. As a student and future engineer, it is particularly important for me to learn how to utilize LinkedIn optimally to find employment opportunities and build my career in the coming years.



<img src="https://admin.drushim.co.il/Content/Uploads/636670041546219798_84.1.jpg" width=700 height=700 align=center />

In [None]:
import pandas as pd
import numpy as np
import datetime 
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="seaborn.axisgrid")


In [None]:
url = ("https://raw.githubusercontent.com/orel15211/finaly/main/Linkedin%20CSV.csv")

In [None]:
data = pd.read_csv(url)

# Wrangling data:


## 1. Handling missing values:

In [None]:
data=data.dropna(subset =['genderEstimate'])
data['hasPicture'].fillna('no picture', inplace = True)
data['companyHasLogo'].fillna('no logo', inplace = True)
data=data.dropna(subset = ['companyName'])

#### 1) We removed missing data rows from "companyName" and "genderEstimate". 
#### 2) From missing values under "hasPicture" and "companyHasLogo" we filled "no picture / logo".

In [None]:
missing=data[['genderEstimate','hasPicture','companyHasLogo','companyName']].isnull().sum()
pd.DataFrame(missing)

## 2. Fix columns:

In [None]:
data[['genderEstimate','hasPicture','companyHasLogo','companyName','followersCount','ageEstimate']].info()

#### All our Dtypes were correct.
## 

# What do you think?
### Who is more likely to use linkedin?
### Male or Female?


<img src="https://askthescientists.com/wp-content/uploads/2018/04/AdobeStock_62125649.png" width=700 height=700 align=center />

In [None]:
plt.figure(figsize=(10, 6))
plt.subplot(1,2,1)
plt.title("Male/Female Percentage",fontsize=25)
plotpie=data['genderEstimate'].value_counts().plot.pie(autopct='%1.1f%%',colors = ['mediumaquamarine', 'plum'],fontsize=15)
plt.legend(fontsize=15)
plt.subplot(1,2,2)
plt.title("Male/Female Count",fontsize=25)
plt.xlabel("Gender Estimate",fontsize=20)
plt.ylabel("Count",fontsize=20)
sns.countplot( x="genderEstimate",data=data , edgecolor = 'black', palette = 'PiYG_r', hue='genderEstimate')
plt.show()

## On these charts we can see that there are more male users than females
## 

## We would like to examine if there's a connection between Gender and Followers Count

In [None]:
plt.figure(figsize=(10, 6))
type_df = data[["genderEstimate", "followersCount"]]
sns.catplot(data=type_df, kind="bar", x="genderEstimate", y = "followersCount",height=5,aspect=1,edgecolor = 'black',palette = 'PiYG_r')
plt.title("Male/Female Followers Count",fontsize=25)
plt.xlabel("Gender",fontsize=20)
plt.ylabel("Followers Count",fontsize=20)

## Although the number of females is significantly lower than the number of males,
## Their average followers are close to the average followers of males.
## Despite their low number, we can  see that they have a high exposure and number of followers (almost the same as males).
## Therefore, it can be assumed that the LinkedIn network is recommended for females.









# 

## We would like to examine the connection between Logo, Picture and Followers Count

<img src="https://digitalpedagogydotwordpressdotcom.files.wordpress.com/2020/02/linkedin-4763813_1920.png" width=600 height=600 align=center />

In [None]:
plt.figure(figsize=(10, 5))
data.loc[data['hasPicture'].str.contains('jpg'), 'hasPicture'] = 'has picture'
data.loc[data['hasPicture'].str.contains('A'), 'hasPicture'] = 'has picture'
data.loc[data['companyHasLogo'].str.contains('png'), 'companyHasLogo'] = 'has logo'
data.loc[data['companyHasLogo'].str.contains('jpg'), 'companyHasLogo'] = 'has logo'
data.loc[data['companyHasLogo'].str.contains('e'), 'companyHasLogo'] = 'has logo'
data.loc[data['companyHasLogo'].str.contains('A'), 'companyHasLogo'] = 'has logo'

### To analyze the data, we replaced values under these columns 

In [None]:
plt.figure(figsize=(15, 4.4))
data.groupby(['companyHasLogo','hasPicture'])['followersCount'].count().plot.bar(edgecolor = 'black',color=['mediumvioletred','deeppink', 'hotpink', 'pink'])
plt.xticks(rotation=60,fontsize=12)
plt.title("Logo/Picture effect on Followers Count",fontsize=22)
plt.xlabel("Logo vs Picture",fontsize=15)
plt.ylabel("Followers Count",fontsize=15)

## The graph clearly shows that users should upload both logo and image.
## A logo has a higher impact on the number of followers than an image, and it shows that an image without a logo is less effective.
## As you can see, users without logo and image have a significantly lower amount of followers.

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(1,2,1)
plt.title("Logo Percentage",fontsize=30)
plotpie=data['companyHasLogo'].value_counts().plot.pie(autopct='%1.2f%%',colors = ['orchid', 'yellow'],fontsize=15)
plt.legend(fontsize=15)
plt.subplot(1,2,2)
plt.title("Pictures Percentage",fontsize=30)
plotpie=data['hasPicture'].value_counts().plot.pie(autopct='%1.2f%%',colors = ['orchid', 'turquoise'],fontsize=15)
plt.legend(fontsize=15)
plt.show()

## As you can see, most of the LinkedIn users have logo and picture.
## 

## i would like to examine the connection between Age and Followers Count

<img src="https://thumbs.dreamstime.com/b/vector-growing-up-baby-becoming-adolescent-mature-man-elderly-disabled-guy-age-evolution-stages-different-162321909.jpg" width=900 height=900 align=left />

In [None]:
age = data['ageEstimate']
plt.figure(figsize=(10, 5))
bins = range(0, 100, 10)
colors = [plt.cm.tab20(i/len(bins)) for i in range(len(bins))]
for i in range(len(bins) - 1):
    plt.hist(age, bins=[bins[i], bins[i+1]], edgecolor='black', color=colors[i])
plt.xticks(range(0, 100, 10))
plt.title("Followers Count by Age", fontsize=25)
plt.xlabel('Age Estimate', fontsize=20)
plt.ylabel('Followers Count', fontsize=20)

## Base on the above graph, users between the ages of 30 to 50 have the highest number of followers.
## i can assume that the high number of followers for this age group is attributed to their  experience and seniority.

In [None]:
maxage=data.groupby('followersCount')[['ageEstimate']].max()
maxage.tail()

## As you can see, the LinkedIn users with the highest number of followers are in the 30-50 age group.

## 

## In conclusion, in my research i wanted to examine the dependency between the following variables : gender, logo & image, age and followers count.

## Our dependent variable is the followers count, and our independent variables are gender, logo & image and age.

<img src="https://allstarsdigital.in/wp-content/uploads/2020/09/linkedin_Ads.png" width=800 height=800 align=center />


## i found that the most useful ways to use LinkedIn is:
## 1. Although the number of females using the network is lower then the number of males, the  followers count percent was almost the same for both.
## 2. Using image is importand for user exposure. However, including a logo has a greater impact.
## 3. The users with the highest number of followers are in the age group of 30-50 .

In [None]:
maxage=data.groupby('followersCount')[['genderEstimate','hasPicture','companyHasLogo','ageEstimate']].max()
maxage.tail()

## As you can see, the findings from the table suitable to my expectations.


<img src="https://www.edigitalagency.com.au/wp-content/uploads/linkedin-logo-gif-funny-man-suitcase.gif" width=600 height=600 align=center />