# Activity 2

**Objective:** To use data visualization as an exploratory tool for the data

In this activity, I will continue exploring the [Mountains dataset](https://www.kaggle.com/abcsds/highest-mountains).

As a reminer, I will bring up the information of the dataset just to setup the stage for doing the analysis.

In [6]:
import pandas as pd

# Import the data from CSV as a panda DataFrame object
mountains = pd.read_csv('../datasets/Mountains.csv')
# Show the first 5 rows of the dataframe
mountains.head()

Unnamed: 0,Rank,Mountain,Height (m),Height (ft),Prominence (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
0,1,Mount Everest / Sagarmatha / Chomolungma,8848,29029,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿,,1953,>>145,121.0
1,2,K2 / Qogir / Godwin Austen,8611,28251,4017,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44.0
2,3,Kangchenjunga,8586,28169,3922,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24.0
3,4,Lhotse,8516,27940,610,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26.0
4,5,Makalu,8485,27838,2386,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52.0


We can see that there is some interesting information about the mountains related to their geography with a mix of mountaineering history -- which is great! This is my way to learn more about the greatest mountain ranges in the world, because as a climber, that is something I should know, right? 

My work in this activity will focus in answering the following questions:
1. What is the range with the highest mountains?
2. Is there a correlation between the number of successful ascents and the height of the mountains?
3. In which range do we find the mountains with more failed attempts than actual acsents?
4. Is there a correlation between the Prominence and the success of an expedition?
5. Which is the most remote mountain in the dataset?

Some of the questions in the list are kind of obvious (at least to me), but I will go ahead with the exercise just to practice my coding, and who knows, I might be surprised...

## Housekeeping

In the previous activity I miss an important part of data cleaning. I was naive and assume that the data was completly clean. However, upon revision of the results I presented [here](), I noticed that the actual rank for the mountain **Distaghil Sar** is not 20 as I said it was, it is actually 19. But, this means it should have been located in row number 18 (zero-indexing) of the data frrame, right? In reality it is located in row 19, which doesn't make too much sense. Here is a snapshot of the first 21 rows of the dataset. I have included only the columns **Rank**, **Mountains** and **Height (m)** for the sake of clarity.

In [13]:
print(mountains.loc[0:20][["Rank","Mountain","Height (m)"]])

    Rank                                  Mountain  Height (m)
0      1  Mount Everest / Sagarmatha / Chomolungma        8848
1      2                K2 / Qogir / Godwin Austen        8611
2      3                             Kangchenjunga        8586
3      4                                    Lhotse        8516
4      5                                    Makalu        8485
5      6                                   Cho Oyu        8188
6      7                              Dhaulagiri I        8167
7      8                                   Manaslu        8163
8      9                              Nanga Parbat        8126
9     10                               Annapurna I        8091
10    11           Gasherbrum I / Hidden Peak / K5        8080
11    12                           Broad Peak / K3        8051
12    13                        Gasherbrum II / K4        8035
13    14                              Shishapangma        8027
14    15                             Gyachung Kang     

Take a look at row 15, the rank assigned to the mountain **Gasherbrum III** is 110, when it should have been 16. This creates a shift in the rest of the data, in addition that the assigned rank is wrong. This is not the only misplaced datapoint in the dataset:

In [14]:
print(mountains.loc[20:23][["Rank","Mountain","Height (m)"]])

    Rank         Mountain  Height (m)
20    20      Ngadi Chuli        7871
21   111           Nuptse        7864
22    21  Khunyang Chhish        7823
23    22  Masherbrum / K1        7821


Here, the mountain **Nuptse** is ranked as 111, when it should be 21. Like these two data points there is more. So, I propose to do the following data cleaning:

1. Create a list that starts from 1, increments by 1 and whose final value is the length of the dataset. This list will be named rank and will be used to replace the data in the **Rank** column of the dataframe.
2. Just to make sure everything is orgnized in descending order, I will orgnize the data in descending order using the column **Height (m)**.
3. Replace the current data in **Rank** with the list created in step 1.

In [38]:
# Step 1 -- create new Rank list
dim = mountains.shape
new_rank = range(1,dim[0]+1)

In [39]:
# Step 2 -- Sort data in descending order using the Height values
mountains.sort_values(by=("Height (m)"), ascending=False)

# Just to make sure, let's look at a random chunk of data. 
print(mountains.loc[90:102][["Rank","Mountain","Height (m)"]])

     Rank                     Mountain  Height (m)
90     91                   Siguang Ri        7309
91     92       The Crown / Huang Guan        7295
92     93                   Gyala Peri        7294
93     94                    Porong Ri        7292
94     95     Baintha Brakk / The Ogre        7285
95     96                  Yutmaru Sar        7283
96     97          Baltistan Peak / K6        7282
97     98  Kangpenqing / Gang Benchhen        7281
98     99                Muztagh Tower        7276
99    100                    Mana Peak        7272
100   101                Dhaulagiri VI        7268
101   102                        Diran        7266
102   103      Labuche Kang III / East        7250


In [40]:
# Step 3 -- Replace the values in the Rank column with the correct rank order (list new_rank)
mountains.loc[0:dim[0]]["Rank"] = new_rank

# Check
print(mountains.loc[90:102][["Rank","Mountain","Height (m)"]])

     Rank                     Mountain  Height (m)
90     91                   Siguang Ri        7309
91     92       The Crown / Huang Guan        7295
92     93                   Gyala Peri        7294
93     94                    Porong Ri        7292
94     95     Baintha Brakk / The Ogre        7285
95     96                  Yutmaru Sar        7283
96     97          Baltistan Peak / K6        7282
97     98  Kangpenqing / Gang Benchhen        7281
98     99                Muztagh Tower        7276
99    100                    Mana Peak        7272
100   101                Dhaulagiri VI        7268
101   102                        Diran        7266
102   103      Labuche Kang III / East        7250


If we compare the result from step 3 and step 2, we can see that the dataframe has been fixed. Now, the rank order is consistent.

## Data visualization

In this section we will try to answer the questions presented at the begining using data visualization with `matplotlib`.

**What is the range with the highest mountains?**

One way to find out is to slice the data per range and find the lowest rank in each group. 