# Capgemini Reviews Dataset

### Introduction

In this lesson, we'll look at some of the reviews of Capgemini.  Let's get started.

### Loading the data

Let's get started by loading our Capgemini data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/python-fundamentals-jigsaw/capgemini-review/main/reviews.csv"
df = pd.read_csv(url)

In [2]:
reviews = df.to_dict('records')

### Exploring the data

Let's begin by exploring the data.  Remember that our first step can be to explore the grain of the data, and then we can move onto getting an overview of the data.

Begin by selecting a the first review and looking at it's keys.

In [3]:
first_review = reviews[0]



# dict_keys(['Title', 'Place', 'Job_type', 'Department', 'Date', 'Overall_rating', 'work_life_balance', 'skill_development', 'salary_and_benefits', 'job_security',
# 'career_growth', 'work_satisfaction', 'Likes', 'Dislikes'])

And from there, look at the corresponding values.

In [4]:


# dict_values(['Senior Consultant', 'Pune', 'Full Time', 'General Insurance Department',
# '8 Sep 2023', 4.0, 4.0, 3.0, 3.0, 4.0, 4.0, 4.0, 'Deserved candidates are promoted promptly.\nUnbiased in providing opportunities to employees, regardless of their gender or any other thing', 'With designation promotions good salary increment is also required'])

### Coercing the data

Now before we get started, let's change the data a bit.  

Create a new list of dictionaries -- it should look just like the old list of reviews, but each should review should have a key of `day`, `month` and `year`, with the day month and year coming from each review's date value.  The date key value pair can be excluded.  Day and year should both be integers.

If there are is a nan for the date, there should still be keys for day, month and year -- but values can be None.

In [9]:
from numpy import NaN
# add new list
new_list = []
for review in reviews[:5]:
  review_copy = review.copy()
  if type(review_copy['Date']) != str:
    day, month, year = None
  else:
    day, month, year = review['Date'].split(' ')
  review_copy['day'] = int(day)
  review_copy['month'] = month
  review_copy['year'] = int(year)
  new_list.append(review_copy)


In [17]:
def build_dated_reviews(reviews):
  reviews_list = []
  for review in reviews:
    review_copy = review.copy()
    if type(review_copy['Date']) != str:
      review_copy['day'] = None
      review_copy['month'] = None
      review_copy['year'] = None
    else:
      day, month, year = review['Date'].split(' ')
      review_copy['day'] = int(day)
      review_copy['month'] = month
      review_copy['year'] = int(year)
    reviews_list.append(review_copy)

  return reviews_list

* Now wrap the code in a function.  The function can be called `build_dated_reviews`, and it takes an argument of `reviews` and returns a new list of dictionaries.

In [18]:
dated_reviews = build_dated_reviews(reviews)
len(dated_reviews)

26993

In [19]:
print(dated_reviews[0])

# {'Title': 'Senior Consultant', 'Place': 'Pune',
# 'Job_type': 'Full Time', 'Department': 'General Insurance Department',
# 'Date': '8 Sep 2023', 'Overall_rating': 4.0,
# 'work_life_balance': 4.0, 'skill_development': 3.0, 'salary_and_benefits': 3.0, 'job_security': 4.0, 'career_growth': 4.0, 'work_satisfaction': 4.0, 'Likes': 'Deserved candidates are promoted promptly.\nUnbiased in providing opportunities to employees, regardless of their gender or any other thing', 'Dislikes': 'With designation promotions good salary increment is also required', 'day': 8, 'month': 'Sep', 'year': 2023}

{'Title': 'Senior Consultant', 'Place': 'Pune', 'Job_type': 'Full Time', 'Department': 'General Insurance Department', 'Date': '8 Sep 2023', 'Overall_rating': 4.0, 'work_life_balance': 4.0, 'skill_development': 3.0, 'salary_and_benefits': 3.0, 'job_security': 4.0, 'career_growth': 4.0, 'work_satisfaction': 4.0, 'Likes': 'Deserved candidates are promoted promptly.\nUnbiased in providing opportunities to employees, regardless of their gender or any other thing', 'Dislikes': 'With designation promotions good salary increment is also required', 'day': 8, 'month': 'Sep', 'year': 2023}


Now let's look at a single review -- to see how these new keys look.

In [20]:
first_dated_review = dated_reviews[0]
first_dated_review['day'], first_dated_review['month'], first_dated_review['year']
# (8, 'Sep', 2023)

(8, 'Sep', 2023)

* And now let's write a function that provided a review, will just return the specified attributes of that review.

In [25]:
def build_selected_review(review, attrs):
    return {k:v for k, v in review.items() if k in attrs}

In [26]:
selected_attrs = ['Title', 'Job_type', 'Department', 'Overall_rating',
                  'day', 'month', 'year']

first_selected = build_selected_review(first_dated_review, selected_attrs)

print(first_selected)

# {'Title': 'Senior Consultant', 'Place': 'Pune',
#  'Job_type': 'Full Time', 'Department': 'General Insurance Department', 'Date': '8 Sep 2023',
#  'Overall_rating': 4.0, 'day': 8, 'month': 'Sep', 'year': 2023}

{'Title': 'Senior Consultant', 'Job_type': 'Full Time', 'Department': 'General Insurance Department', 'Overall_rating': 4.0, 'day': 8, 'month': 'Sep', 'year': 2023}


And now write a function called `build_selected_reviews` that only returns the specified attributes for each of the reviews.

In [27]:
def build_selected_reviews(reviews, attrs):
    return [build_selected_review(review, attrs) for review in reviews]

Ok, so we can see that we have attributes of the job title, location of the job, department, date, and various ratings.

In [28]:
selected_attrs = ['Title', 'Job_type', 'Department', 'Overall_rating',
                  'day', 'month', 'year']

selected_reviews = build_selected_reviews(dated_reviews, selected_attrs)


In [29]:
selected_reviews[0]
# {'Title': 'Senior Consultant',
#  'Job_type': 'Full Time',
#  'Department': 'General Insurance Department',
#  'Overall_rating': 4.0,
#  'day': 8,
#  'month': 'Sep',
#  'year': 2023}


{'Title': 'Senior Consultant',
 'Job_type': 'Full Time',
 'Department': 'General Insurance Department',
 'Overall_rating': 4.0,
 'day': 8,
 'month': 'Sep',
 'year': 2023}

### Exploring the data

Now we can explore some of the data.  The first step is to get an overview of the data.

Begin by finding the number of unique titles across the selected_reviews.

In [31]:
len(list(set(review['Title'] for review in dated_reviews)))
# 4241

4241

Ok, that's a lot.

Now write a function that given a year, finds only the `selected_reviews` written in that year.

In [32]:
selected_reviews[0]

{'Title': 'Senior Consultant',
 'Job_type': 'Full Time',
 'Department': 'General Insurance Department',
 'Overall_rating': 4.0,
 'day': 8,
 'month': 'Sep',
 'year': 2023}

In [34]:
def reviews_for_year(reviews, year):
    return [review for review in reviews if review['year'] == year]

In [35]:
selected_reviews_2023 = reviews_for_year(selected_reviews, 2023)
len(selected_reviews_2023)

# 6203

6203

Now *use the function above* to create a dictionary that has keys of the years 2017 through 2023, and values of the number of reviews per each year.

In [39]:
years = [2017, 2018, 2019, 2020, 2021, 2022, 2023]
def reviews_per_year(reviews, years):
  review_count = {}
  for year in years:
    reviews_in_year = reviews_for_year(reviews, year)
    review_count[year] = len(reviews_in_year)
  return review_count

reviews_per_year(dated_reviews, years)

# {2017: 491, 2018: 4065,
# 2019: 3103, 2020: 1163, 2021: 2804, 2022: 8086, 2023: 6203}

{2017: 491,
 2018: 4065,
 2019: 3103,
 2020: 1163,
 2021: 2804,
 2022: 8086,
 2023: 6203}

Next create a dictionary of the average `Overall_rating` per year -- this can be referred to as `hist` (for histogram).

In [58]:
scores = {}
for year in years:
  ratings = [review['Overall_rating'] for review in dated_reviews if review['year'] == year and float(review["Overall_rating"]) == True]
  print(year, ratings)
  avg_score = sum(ratings) / len(ratings)
  scores[year] = avg_score
scores

2017 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
2018 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0

{2017: 1.0, 2018: 1.0, 2019: 1.0, 2020: 1.0, 2021: 1.0, 2022: 1.0, 2023: 1.0}

In [45]:
def overall_rating_per_year(reviews, year):
  scores = {}
  for year in years:
    ratings = [review['Overall_rating'] for review in reviews if review['year'] == year]
    avg_score = sum(ratings) / len(ratings)
    scores[year] = avg_score
  return scores


In [46]:
hist = overall_rating_per_year(dated_reviews, years)
hist

# {2017: 3.35,
#  2018: 3.16,
#  2019: 3.48,
#  2020: 3.75,
#  2021: 3.84,
#  2022: 3.98,
#  2023: 3.78}

{2017: nan,
 2018: nan,
 2019: nan,
 2020: 3.748925193465176,
 2021: 3.835592011412268,
 2022: 3.9802127133316842,
 2023: 3.780428824762212}

> Hint: If you are getting getting nan values, it is likely because there are nan values in some of the reviews.  You can exclude these records by adding something like the following to your code.

```python
import math
if not math.isnan(review['Overall_rating'])
```