# Week 8 Lab - Visualizations

<img align="right" style="padding-right:10px;" src="figures_wk8/tableau_sample.png" width=350><br>
This week's assignment will focus on using Tableau to produce insights and visualizations for a dataset of your choice.
## Data: 
Find an "interesting" data set to work with. UCI Machine Learning Archive and data.gov are always good places to start. You will need two datasets in total.

## Task 1:
For one of your chosen datasets, complete the following
1. Create at least 3 descriptive graphics to communicate some aspect of your data.
2. Pay attention to the following:
   - Use of colors
   - Labeling of your axes
   - Descriptive title for your charts  
3. Make sure you use the following at least once within your 3 graphics
    - Change the Tableau default data display type (Think back to our Date field being aggregated by year)
    - Measure Values
    - Calculated field
4. Create at least one dashboard
5. Create a Story

## Task 2:
For your second chosen datasets, complete the following.
1. Create at least one graphic that was not demostrated in the Lecture FTE.
   Take a look at this [Tableau Tutorial](https://www.tutorialspoint.com/tableau/tableau_dashboard.htm) page for additional graphic types.
2. Create a dashboard with your graphic(s) from step 1.


**Important:** Make sure your graphics are fairly self explanatory.  If you need to provide additional explanations of the information you are conveying to me, add this information to you Dashboard or Story page.

## Deliverables:

Upload your Tableau workbooks to WorldClass. You can choice to do both Tasks in one workbook or separate workbooks.

# I. Introduction

In this assignment, I was tasked to find 2 datasets and use Tableau to find relationships between the data and tell a story. I found that much more time was needed learn Tableau in order to properly complete this assignment. Working with a brand new software and 2 brand new datasets proved more work than anticipated! This was a very challenging yet interesting assignment as I'm still very new to Tableau and trying to do my best to learn the new software.

# II. Methods, III. Code, and IV. Analysis of Results

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
sns.set()


In [2]:
df1 = pd.read_csv("data-scientist-salaries/data_cleaned_2021.csv")
df1.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,tensor,hadoop,tableau,bi,flink,mongo,google_an,job_title_sim,seniority_by_title,Degree
0,0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 - 1000,1973,...,0,0,1,1,0,0,0,data scientist,na,M
1,1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+,1984,...,0,0,0,0,0,0,0,data scientist,na,M
2,2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 - 1000,2010,...,0,0,0,0,0,0,0,data scientist,na,M
3,3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 - 5000,1965,...,0,0,0,0,0,0,0,data scientist,na,na
4,4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 - 200,1998,...,0,0,0,0,0,0,0,data scientist,na,na


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 742 entries, 0 to 741
Data columns (total 42 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               742 non-null    int64  
 1   Job Title           742 non-null    object 
 2   Salary Estimate     742 non-null    object 
 3   Job Description     742 non-null    object 
 4   Rating              742 non-null    float64
 5   Company Name        742 non-null    object 
 6   Location            742 non-null    object 
 7   Headquarters        742 non-null    object 
 8   Size                742 non-null    object 
 9   Founded             742 non-null    int64  
 10  Type of ownership   742 non-null    object 
 11  Industry            742 non-null    object 
 12  Sector              742 non-null    object 
 13  Revenue             742 non-null    object 
 14  Competitors         742 non-null    object 
 15  Hourly              742 non-null    int64  
 16  Employer

In [4]:
df1.shape

(742, 42)

I think I will be using the following columns for my data analysis / useful charts:
* Job Title
* Location
* Size
* Founded
* Industry
* Lower Salary
* Upper Salary
* Avg Salary

In [15]:
pd.set_option("display.max_rows", None)

In [16]:
df1.isnull().sum()

index                 0
Job Title             0
Salary Estimate       0
Job Description       0
Rating                0
Company Name          0
Location              0
Headquarters          0
Size                  0
Founded               0
Type of ownership     0
Industry              0
Sector                0
Revenue               0
Competitors           0
Hourly                0
Employer provided     0
Lower Salary          0
Upper Salary          0
Avg Salary(K)         0
company_txt           0
Job Location          0
Age                   0
Python                0
spark                 0
aws                   0
excel                 0
sql                   0
sas                   0
keras                 0
pytorch               0
scikit                0
tensor                0
hadoop                0
tableau               0
bi                    0
flink                 0
mongo                 0
google_an             0
job_title_sim         0
seniority_by_title    0
Degree          

In [17]:
df1['Job Title'].value_counts()

Data Scientist                                                                                        131
Data Engineer                                                                                          53
Senior Data Scientist                                                                                  34
Data Analyst                                                                                           15
Senior Data Engineer                                                                                   14
Senior Data Analyst                                                                                    12
Lead Data Scientist                                                                                     8
Marketing Data Analyst                                                                                  6
Sr. Data Engineer                                                                                       6
Machine Learning Engineer                     

In [18]:
df1['Location'].value_counts()

New York, NY                         55
San Francisco, CA                    49
Cambridge, MA                        47
Chicago, IL                          32
Boston, MA                           23
San Jose, CA                         13
Pittsburgh, PA                       12
Washington, DC                       11
Rockville, MD                        11
Winston-Salem, NC                    10
Richland, WA                         10
Herndon, VA                          10
Indianapolis, IN                      9
San Diego, CA                         9
Mountain View, CA                     8
Austin, TX                            8
South San Francisco, CA               8
Rochester, NY                         7
Palo Alto, CA                         7
Salt Lake City, UT                    6
Huntsville, AL                        6
Marlborough, MA                       6
Phoenix, AZ                           6
Charlotte, NC                         6
Chantilly, VA                         6


In [19]:
df1['Size'].value_counts()

1001 - 5000      150
501 - 1000       134
10000+           130
201 - 500        117
51 - 200          94
5001 - 10000      76
1 - 50            31
unknown           10
Name: Size, dtype: int64

In [20]:
df1['Founded'].value_counts()

-1       50
 2010    32
 2008    31
 1996    27
 2006    24
 2012    21
 2011    19
 1958    18
 2007    18
 1984    18
 2002    18
 2015    16
 2013    15
 1875    14
 1997    14
 1851    14
 1781    14
 2014    13
 1965    12
 2017    12
 1999    12
 2005    10
 1912    10
 2003    10
 2000    10
 1935    10
 1961     9
 1913     9
 1982     9
 1981     9
 1977     8
 1995     8
 1939     8
 1989     8
 1969     8
 1968     8
 1976     8
 1849     7
 1988     7
 1992     7
 1948     6
 2004     6
 1986     6
 1993     6
 2009     6
 1870     6
 1967     5
 1966     5
 2016     5
 1973     5
 1852     5
 1964     4
 1830     4
 1991     4
 1994     4
 1925     4
 1915     4
 1947     3
 1970     3
 1943     3
 1922     3
 1972     3
 2001     3
 1978     3
 1863     3
 1885     3
 1937     3
 1990     3
 1998     3
 1987     2
 1974     2
 1952     2
 1856     2
 1983     2
 1962     2
 1980     2
 1954     2
 1975     2
 1951     2
 2019     2
 1846     2
 1928     2
 1914     1
 181

In [21]:
df1['Industry'].value_counts()

Biotech & Pharmaceuticals                   112
Insurance Carriers                           63
Computer Hardware & Software                 59
IT Services                                  50
Health Care Services & Hospitals             49
Enterprise Software & Network Solutions      42
Internet                                     29
Consulting                                   29
Aerospace & Defense                          25
Advertising & Marketing                      25
Consumer Products Manufacturing              20
Research & Development                       19
Colleges & Universities                      16
Energy                                       14
Banks & Credit Unions                        12
Federal Agencies                             11
-1                                           10
Staffing & Outsourcing                       10
Travel Agencies                               8
Lending                                       8
Food & Beverage Manufacturing           

In [24]:
df1['Lower Salary'].value_counts()

43     22
65     20
61     18
80     18
52     18
49     18
81     17
74     16
63     16
56     16
86     15
60     15
54     14
42     14
71     13
44     12
100    11
37     11
68     11
110    11
64     10
75     10
50     10
39     10
83     10
76     10
55      9
59      9
108     9
102     9
72      9
85      9
31      8
90      8
40      8
82      8
35      8
48      8
97      7
120     7
69      7
62      7
150     7
107     7
67      6
95      6
91      6
116     6
53      6
32      6
84      6
79      6
45      6
38      5
113     5
92      5
111     5
87      5
66      5
34      5
77      5
105     5
109     5
124     4
58      4
114     4
57      4
94      4
47      4
36      4
93      4
118     4
73      4
190     3
101     3
138     3
127     3
202     3
78      3
70      3
33      3
117     3
41      3
121     3
200     3
89      3
99      3
139     3
20      3
98      2
126     2
27      2
119     2
135     2
106     2
125     2
115     2
158     2
88      2
132     2


In [25]:
df1['Upper Salary'].value_counts()

140    16
119    15
110    15
124    15
113    13
127    13
86     12
173    12
101    12
139    11
85     11
142    10
62     10
97     10
134    10
160    10
123     9
99      9
133     9
112     9
129     8
105     8
149     8
143     8
132     8
126     7
148     7
115     7
78      7
95      7
82      7
96      7
91      7
81      7
172     7
135     7
93      7
144     7
70      6
76      6
100     6
71      6
92      6
68      6
182     6
66      6
106     6
80      6
111     6
137     6
211     6
179     6
158     5
159     5
125     5
130     5
98      5
167     5
157     5
220     5
72      5
109     5
146     5
52      5
114     5
120     5
136     4
121     4
90      4
189     4
208     4
116     4
89      4
102     4
150     4
117     4
175     4
147     4
180     4
176     4
171     4
178     4
161     4
166     4
153     4
224     3
108     3
59      3
155     3
58      3
64      3
199     3
306     3
196     3
88      3
200     3
57      3
194     3
77      3
55      3


In [26]:
df1['Avg Salary(K)'].value_counts()

87.5     12
140.0    11
81.0     11
85.0     10
107.5    10
56.5     10
84.5     10
107.0    10
87.0      9
120.0     9
154.5     8
109.0     8
70.5      8
76.5      8
100.0     7
65.0      7
85.5      7
95.0      7
121.0     7
62.5      7
61.0      7
114.5     7
77.5      7
80.5      7
54.0      6
51.5      6
139.5     6
68.5      6
106.5     6
124.0     6
52.5      6
112.5     6
96.0      6
61.5      6
94.5      6
98.0      5
75.5      5
66.5      5
128.5     5
44.5      5
99.0      5
93.5      5
114.0     5
99.5      5
92.0      5
111.5     5
80.0      5
98.5      5
101.0     5
73.0      5
113.5     5
73.5      5
103.5     5
65.5      5
139.0     5
55.0      4
91.5      4
124.5     4
161.5     4
64.0      4
110.5     4
71.5      4
162.0     4
72.5      4
100.5     4
138.5     4
147.0     4
117.5     4
48.5      4
86.5      4
109.5     4
130.0     4
97.5      4
84.0      4
105.5     4
142.5     4
115.0     4
90.0      4
69.5      4
173.0     3
155.0     3
142.0     3
254.0     3
169.

I had some trouble loading this cleaned data into Tableau! Surprisingly Tableau wasn't recognizing the columns, it was grabbing random sections of text. Hmm.

# V. Conclusion

This assignment was a bit difficult! I found it difficult managing the combination of both a brand new software and a dataset I had to pick and clean myself. The combination made the assignment quite difficult to get perfectly correct. I will continue working with Tableau though! I'm sure it is just the beginning of my Tableau career.

Thank you!
Jeremy

# VI. References

1) From the Experts PDF, Week 8

2) Tableau Tutorial. (2020). Tutorials Point. Retrieved May 1, 2022, from https://www.tutorialspoint.com/tableau/

3) voter registration dataset, https://github.com/fivethirtyeight/data/tree/master/voter-registration

4) data scientist salaries dataset, https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor