Skip to content

! This assignment is designed to test the knowledge of a wide range of concepts and SQL design techniques discussed throughout the course.

License

Notifications You must be signed in to change notification settings

lmlimasd/Final_exercise_SQL_course_2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Final Project Course

SQL for Data Science - Coursera 2021 - University of California, Davis

As the final exercise in the course SQL for Data Science. This assignment is designed to test the knowledge of a wide range of concepts and SQL design techniques discussed throughout the course. Using a dataset from a US-based organization called Yelp Open Dataset , which provides a platform for users to provide reviews and rate their interactions with a variety of organizations – businesses, restaurants, health clubs, hospitals, local governmental offices, charitable organizations, etc. Yelp has made a portion of this data available for personal, educational, and academic purposes. Check out the Yelp Dataset ER Diagram and instructions below for more detailed information on the dataset and how was completed the assignment.

Note: For most information about the Dataset visit:

Yelp Dataset ER Diagram

The entity relationship (ER) diagram below, should help familiarize you with the design of the Yelp Dataset provided for this peer review activity.

Markdown Monster icon

Note: Primary Keys are denoted in the ER-Diagram with a yellow key icon.

you are asked a series of questions that will help you profile and understand the data just like a data scientist would. For this first part of the assignment, you will be assessed both on the correctness of your findings, as well as the code you used to arrive at your answer. You will be graded on how easy your code is to read, so remember to use proper formatting and comments where necessary.

1. Profile the data by finding the total number of records for each of the tables below:

Answer: The dataset have 11 tables with 10.000 rows every each.

-- Code Used --
+ SELECT COUNT(*) 
+ FROM "name of the table"

2. Find the total distinct records by either the foreign key or primary key for each table. If two foreign keys are listed in the table, please specify which foreign key:

Answer:

  • Business = (PK), id= 10000
  • Hours = (FK), business_id= 1562.
  • Category = (FK), business_id= 2643.
  • Attribute = (FK), business_id= 1115.
  • Review = (PK), id=10000, (FK) business_id= 8090, (FK)User_id= 9581.
  • Checkin = (FK), business_id= 493.
  • Photo = (PK), id=10000, (FK)business_id= 6493.
  • Tip = (FK), User_id= 537, (FK)business_id= 3979.
  • User = (PK),id=10000.
  • Friend = (FK), User_id= 11, (FK)
  • Elite_years = (FK), User_id= 2780.
-- Code Used --
+ SELECT count(DISTINCT /*Primary or Foreign Key*/)
+ FROM -- Table

3. Are there any columns with null values in the Users table? Indicate "yes," or "no."

Answer: NO

-- Code Used --
+ SELECT COUNT(*)
+ FROM User 
+ WHERE id IS NULL 
+      OR name IS NULL
+      OR review_count IS NULL
+      OR yelping_since IS NULL
+      OR useful IS NULL
+      OR funny IS NULL
+      OR cool IS NULL
+      OR fans IS NULL
+      OR average_stars IS NULL
+      OR compliment_hot IS NULL
+      OR compliment_more IS NULL
+      OR compliment_profile IS NULL
+      OR compliment_cute IS NULL
+      OR compliment_list IS NULL
+      OR compliment_note IS NULL
+      OR compliment_plain IS NULL
+      OR compliment_cool IS NULL
+      OR compliment_funny IS NULL
+      OR compliment_writer IS NULL
+      OR compliment_photos IS NUL

4. For each table and column listed below, display the smallest (minimum), largest (maximum), and average (mean) value for the following fields:

Answer:

  • Table: Review, Column: Stars
  • min: 1.0 max: 5.0 avg: 3.7082

  • Table: Business, Column: Stars
  • min: 1.0 max: 5.0 avg: 3.6549

  • Table: Tip, Column: Likes
  • min: 0 max: 2 avg: 0.0144

  • Table: Checkin, Column: Count
  • min: 1 max: 53 avg: 1.9414

  • Table: User, Column: Review_count
  • min: 0 max: 2000 avg: 24.2995

-- Code Used --
+SELECT MIN(/*Name of the column*/) AS "MIN"
+      ,MAX(/*Name of the column*/) AS "MAX"
+      ,AVG(/*Name of the column*/) AS "AVG"
+FROM /*Name of the Table*/

List the cities with the most reviews in descending order:

Answer:

City All_reviews_by_city
Las Vegas
Phoenix
Toronto
Scottsdale
Charlotte
Henderson
Tempe
Pittsburgh
Montréal
Chandler
Mesa
Gilgert
Cleveland
Madison
Glendale
Mississauga
Edinburgh
Peoria
North Las Vegas
Markham
Champaign
Stuttgart
Surprise
Lakewood
Goodyear
82854
34503
24113
20614
12523
10871
10504
9798
9448
8112
6875
6380
5593
5265
4406
3814
2792
2624
2438
2352
2029
1849
1520
1465
1155
-- Code Used --
+ SELECT city
+      ,SUM(review_count) AS All_reviews_by_city
+ FROM business
+ GROUP BY city
+ ORDER BY all_reviews_by_city DESC

6. Find the distribution of star ratings to the business in the following cities:

Avon:

Answer:

Star rating count
1.5
2.5
3.5
4.0
4.5
5.0
10
6
88
21
31
3
-- Code Used --
+ SELECT stars AS "Star rating"
+      ,sum(review_count) AS "count"
+ FROM business
+ WHERE city == "Avon"
+ GROUP BY stars

7. Find the top 3 users based on their total number of reviews

Answer:

name total number of reviews
Gerald
Sara
Yuri
2000
1629
1339
-- Code Used --
+ SELECT name
+      ,sum(review_count) AS "total number of reviews"
+ FROM user
+ GROUP BY id
+ ORDER BY sum(review_count) DESC
+ LIMIT 3

8. Does posing more reviews correlate with more fans?

Answer:

Tables 1 and 2 compare the number of reviews vs the number of fans. It is observed that only the user Gerald has a considerable number of reviews and fans, while the rest of the user varies his position in the top. So it can be concluded that there is NO CORRELATION BETWEEN a greater number of reviews and a greater number of fans.

Table 1. Total_ reviews vs Total_fans. Order by number of reviews in form descendent.

name total number of reviews total number of fans
Gerald
Sara
Yuri
.Hon
William
2000
1629
1339
1246
1215
253
50
76
101
126

Table 2. Total_ reviews vs Total_fans. Order by number of fans in form descendent.

name total number of reviews total number of fans
Amy
Mimi
Harald
Gerald
Christine
609
968
1153
2000
930
503
497
311
253
173
-- Code Used --
+ SELECT name
+      ,sum(review_count) AS "total number of reviews"
+      ,sum(fans) AS "total number of fans"
+ FROM user
+ GROUP BY id
+ ORDER BY sum(review_count) DESC -- table 1
+ ORDER BY sum(fans) DESC -- table 2

8. Does posing more reviews correlate with more fans?

Answer:

YES, there are 1780 reviews with the word "love" and only 232 reviews with the word "hate".

-- Code Used --
+ SELECT (SELECT count(id)
+        From review
+        WHERE text LIKE "%love%") AS "love"
+      ,(SELECT count(id)
+        From review
+        WHERE text LIKE "%hate%") AS "hate"

8. Find the top 10 users with the most fans:

Answer:

YES, there are 1780 reviews with the word "love" and only 232 reviews with the word "hate".

name number of fans
Amy
Mimi
Harald
Gerald
Christine
Lisa
Cat
William
Fran
Lissa
503
497
311
253
173
159
133
126
124
120
-- Code Used --
+	SELECT name
+      ,SUM(fans) AS "number of fans"
+ FROM user
+ GROUP BY Id
+ ORDER BY fans DESC
+ LIMIT 10

8. Find the top 10 users with the most fans:

Answer:

YES, there are 1780 reviews with the word "love" and only 232 reviews with the word "hate".

About

! This assignment is designed to test the knowledge of a wide range of concepts and SQL design techniques discussed throughout the course.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published