# Case Study Overview

Target Variable
-------

In the world of machine learning, the target variable is defined as the variable or column in a dataset whose value is to be predicted or analysed by using the other variables in the same dataset. For our case study, can you guess which of the following is the target variable?

    App

    Size

    Content Rating

    Rating

`Rating` ✓ Correct

Feedback:

You want to analyse the data set to find out the features that determine whether an app is performing well or not in ratings. Therefore, the Rating column is our target variable. You’ll be analysing the way the rating varies across different categories of other variables to determine the most important indicators for the high-performing apps.

Frequency of Mode
---------

Once you impute the missing values for Current Ver. as mentioned towards the end of the above video, answer the following question. If you’re having some difficulties in doing the same, check the feedback

After the imputation step, how many values in Current Ver. are of the type “Varies with Device”?
 

    1415

    1418

    1422

    1419


`1419` ✓ Correct

Feedback:

Use the following code to replace the null values

    inp1['Current Ver'] = inp1['Current Ver'].fillna(inp1['Current Ver'].mode()[0])
    inp1['Current Ver'].isnull().sum()

    #After that do a value_counts()
    inp1['Current Ver'].value_counts()

You’ll see that a total of 1419 values are 'Varies with Device'.

[Run Code in this Cell](https://colab.research.google.com/drive/1hYabr8ogTPWFJ3rSM-AOBF3ZX6-FJDFd?authuser=1#scrollTo=e_ilT9u4DxFA)

Mean
------

What is the average Price for all the apps which have the Android version as “4.1 and up” ?


    $12.11

    $0.56

    $0.67

    Some error comes up while calculating the value


`$0.67 ` ✓ Correct

Feedback:

Observe that Price is of an object datatype. If you try to solve this problem directly without converting it into a float, then it would show up an error.

Fixing the Installs column.
-----------

After removing the additional symbols in the Installs column, calculate the approximate number of installs at the 50th percentile.

    [Hint - You can use the replace() function here.]

      100,000
      500,000
      10,000
      1,000,000

------------

      500,000

      ✓ Correct
          Feedback:
      Please check the video solution.

Box Plots
------

Since you’re already familiar with box plots, can you tell what is represented by the thick black line in the middle of a box plot?

<br>
<center>

![ss](https://images.upgrad.com/72b29527-3f87-4efc-b13a-e5ca80350886-Box%20Plots.JPG)

</center>
<br>

-----------

Options

    Mean
    Mode
    Median
    Standard Deviation

---------------

    Median

    ✓ Correct
      Feedback:

    The line in the middle of a box plot represents the median value.

    Let’s now hear from Rahim as he explains the other values that 
    are represented by the elements of a box plot.

Box Plots
------

For the given dataset, calculate the IQR of the Price column.


    0

    10

    14

    400

------------------

`0` ✓ Correct

    Feedback:

    You can either draw a box plot or use the describe function here. If you do

    inp1['Price'].describe()
    
    You shall see that the values obtained at both 25th and 
    75th percentile are 0. Hence the IQR is also 0.

Histograms
-----------

Plot a histogram for the Reviews column again and choose the correct option:

----------

    The peak  is now towards the end of the histograms

    The peak is still at the beginning of the histogram

    There are two peaks now.

    None of the above

-----------

    The peak is still at the beginning of the histogram

    ✓ Correct
      Feedback:

Please check the video solution if you're facing any difficulty

Use the following command and you can observe that the peak is still at the beginning.

    plt.hist(inp1.Reviews)

Analysing the Installs Column
-------------

Calculate the IQR of the Installs column.

    9.9 ∗ 10^5

    9.9 ∗ 10^3

    9.9 ∗ 10^4

    None of the above

-----------

`9.9 ∗ 10^5` ✓ Correct

    Feedback:

Utilise either a boxplot or the describe() function to solve this. 

The code would be 

    inp1.Installs.describe()

----------

    a = np.percentile(inp1.Installs, 75)
    b = np.percentile(inp1.Installs, 25)

    a, b, a-b
    # (1000000.0, 10000.0, 990000.0)

 After that, calculate the IQR by subtracting 25th percentile value from the 75th percentile and you'll get the answer    

Analysing the Installs Column
----------

Now, remove all the apps which have the number of installs greater than 100 million. 

After that, evaluate the shape of the data and choose the correct option.

---------------

    The resulting dataframe has 7345 records remaining.

    The resulting dataframe has 8645 records remaining.

    The resulting dataframe has 8624 records remaining.

    The resulting dataframe has 7324 records remaining.

------------

    The resulting dataframe has 8624 records remaining.

    ✓ Correct
    Feedback:

Please check the video solution if you're facing any difficulty

Utilise a qualifier and then the shape function. Use the following code:

    inp1 = inp1[inp1.Installs <= 100000000]
    inp1.shape
 

The final shape comes out to (8624,13)

Analysing the Size Column
----------

Plot a histogram for the Size column and then choose the correct option.

--------------

    A majority of apps have a size less than 30,000.

    A majority of apps have a size more than 30,000.

    A majority of apps have a size more than 40,000.

----------------

    A majority of apps have a size less than 30,000.

    ✓ Correct
      Feedback:

Plot a histogram using either matplotlib or the pandas functionality.

    plt.hist(inp1.Size)

Once you create the histogram, it is clearly visible that the first three peaks outweigh the rest of the bars, and hence you can say that a majority of apps have a size less than 30,000.

Analysing the Size column
-----------

Analyse the size column using a boxplot and report back the approximate median value.


    12,000

    26,000

    14,000

    18,000

-----------

    18,000

    ✓ Correct
    Feedback:

Create a box plot using either the matplotlib or the pandas functionality.  Use the following code.

    plt.boxplot(inp1.Size)
    plt.show()
 
Once it is done, you can see that the median value lies at around 18,000.

Also, check the video solution for understanding a bit more about why you should keep the outliers here and not remove them.

Histograms vs Bar Plots
-----------

You have already studied bar plots in the previous module. Now it is a common misconception to confuse them with histograms. To understand the difference try analysing the following two situations and then choose the correct option:


`Situation A` - You want to visualise the total number of runs scored by MS Dhoni in a single year against all the teams he has played against.

`Situation B` - You want to visualise the spread of the runs scored by MS Dhoni in a single year.


    Both situations require a histogram.

    Both situations require a bar plot.

    Situation A requires a bar plot whereas Situation B requires a histogram.

    Situation A requires a histogram whereas Situation B requires a bar plot.

----------------

    Situation A requires a bar plot whereas Situation B requires a histogram.

    ✓ Correct
      Feedback:

`A Histogram` plots the frequency of a `numeric variable`, whereas the `Bar plot` shows the aggregation of a certain numerical entity for some `ategorical variable`. 

-----------

`In Situation A`, you are analysing the total sum of runs, which is a numeric variable for all the teams, which is a categorical variable. 

    Hence it will need a bar plot. 

`For Situation B`, you're understanding the spread of a numeric variable by checking the frequency. 

    Hence a histogram will be used here.


Distplot
-----------

If you want a view like the one shown below for the Rating column, the corresponding code that you would need to add would be?

<br>

![ss](https://images.upgrad.com/ca32c363-add5-4b43-b2d1-773d49375d1f-image22.png)

    sns.distplot(inp1.Rating,rug = False)

    sns.distplot(inp1.Rating,kde = True)

    sns.distplot(inp1.Rating,rug = True, fit = norm)

    sns.distplot(inp1.Rating,kde = False)

------------

    sns.distplot(inp1.Rating,kde = False)

    ✓ Correct
      Feedback:

The KDE parameter in the distplot checks whether a Gaussian Density Estimate is required or not. By default, it is set as True. 

Hence, setting the KDE as False would produce only the distribution plot shown above.

Distplot bins
----------

Observe that there are certain gaps in the distplot view that we have shown above. 

This is because the number of bins created is quite high and hence some bins/buckets have no density at all. 

Now, you wish to set the number of bins to 15 to remove those gaps. 

Which of the following distplots shows the number of bins set to 15?

<br>

![ss](https://images.upgrad.com/9bbdc26d-54ff-4b72-8c7f-e4e4b370c01d-dataviz1.JPG)

    A
    B
    C
    None of the above

-----------

`B` ✓ Correct
    Feedback:

    Set the bins parameter to 15 in sns.distplot() and verify the images. 
    The code would be sns.distplot(inp1.Rating, bins=15)

Barplot
---------

Plot a bar plot for apps belonging to different types of ‘Android Ver’ and report back the category at the 4th highest peak.


    4.1  and up

    4.0 and up

    Varies with device
    
    4.0.3 and up

-----

    Varies with device

    ✓ Correct
      Feedback:

Plot a bar plot with the following code: 

`inp1['Android Ver'].value_counts().plot.bar()`

You can see that at the 4th highest peak you have the ‘Varies with device’ category.

Jointplot
------------

In case you want to remove the histogram/distribution plot appearing on the jointplot’s axes, the command that you need to use is?

    hist = False

    dist = False

    kde = False

    None of the above.

-------------

    None of the above.

    ✓ Correct
      Feedback:

You cannot remove the distribution plot from the Jointplot. 

In case you don’t want it, you can always use

`pyplot.scatter()` or `sns.scatterplot()` to plot the same variables.

Estimator
--------

Change the estimator function in the graph above to analyse minimum Rating for each of the different categories of ‘Content Rating’. 

Which category has the highest minimum rating?


    Everyone

    Mature 17+

    Teen

    Everyone 10+

--------------

    Teen

    ✓ Correct
      Feedback:

Change the estimator function to np.min and plot using

    sns.barplot(data=inp1, x="Content Rating", y="Rating", estimator = np.min)


Capping
------

Plot a boxplot for the Rating column. The lower fence gets capped between

    2.0 - 2.5

    2.5- 3.0

    3.0 - 3.5

    4.0 -4.5

------------

    3.0 - 3.5

    ✓ Correct
      Feedback:

Plot a box plot with `sns.boxplot(inp1.Rating)`

You’ll observe that the `lower fence is between 3.0 -3.5`

Lower Fence
------------

For the 4 most popular Genres, plot a box plot and report back the 
Genre having the highest Rating at the lower fence.
 

    [Hint: For finding the top 4 most popular Genres, 
    you may use the value_counts() function. 

    After that subset the dataframe to only contain the data 
    for these specific Genre types]

--------

    Tools

    Medical

    Education

    Entertainment

-------------

    Education

    ✓ Correct
      Feedback:

First, you need to find the 4 most popular Genres. 

This can be done by the following code

    inp1['Genres'].value_counts()

This will yield you the Top 4 Genres- Tools, Entertainment, Medical and Education.

Take all the rows having only these as the values of Genres.

    c = ['Tools','Entertainment','Medical','Education']
    inp5= inp1[inp1['Genres'].isin(c)]

Finally, plot a box plot 

    sns.boxplot(inp5['Genres'],inp1.Rating)

You can observe that the highest value at lower fence occurs for ‘Education’ Genre.