# Activity 01
**Objective:** Find, import and explore a dataset.

I have chosen a the dataset on information about the world's tallest [mountains](https://www.kaggle.com/abcsds/highest-mountains). 

**Data description from the source:** 

Wikipedias list of world's highest mountains. Shared as stated on wikipedias webpage under the Creative Commons Attribution-ShareAlike License

The columns are:

* Rank: Position in the rank of highest mountains in the world. (Integer)<br>
* Mountain: Name or names of the mountain. (Text separated by slashes)<br>
* Height (m): Height in meters of the mountain. (Integer)<br>
* Height (ft): Height in feet of the mountain. (Integer)<br>
* Prominence (m): [Topographic prominence](https://en.wikipedia.org/wiki/Topographic_prominence) of the mountain in meters. (Integer)<br>
* Range: What mountain range does this mountain belong to. (Text)<br>
* Coordinates: Coordenates of the highest point. (Text in [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) format)<br>
* Parent mountain: [Line parent](https://en.wikipedia.org/wiki/Line_parent) of the mountain. (Text)<br>
* First ascent: Year of first ascent. (Year-formatted Integer)<br>
* Ascents bef. 2004: Number of successful ascents before 2004. (Integer)<br>
* Failed attempts bef. 2004: Number of unsuccessful ascents before 2004. (Integer)


## Summary of the dataset

As you can see in the code below, this is a small dataset containing **118** rows and **11** columns. There is also a good mix of data types within the data including floats, ints and different types of objects (like the coordinates, for example).

The first five rows are shown with the method `head()`. The default number of rows to show is five. The data is given in descending order based on the height, hence Everest is number one. From here, we can also see that the higher mountains are in the Himalaya range. The method `info()` gives a summary of each column, number of rows and columns, and the data type of the data on each column.

In [33]:
import pandas as pd

# Import the data from CSV as a panda DataFrame object
mountains = pd.read_csv('../datasets/Mountains.csv')

dim = mountains.shape
print "The total number of rows and colums in the dataset is "+str(dim[0])+" and "+str(dim[1])+", respectively"

# Show the first 5 rows of the dataframe
mountains.head()


The total number of rows and colums in the dataset is 118 and 11, respectively


Unnamed: 0,Rank,Mountain,Height (m),Height (ft),Prominence (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
0,1,Mount Everest / Sagarmatha / Chomolungma,8848,29029,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿,,1953,>>145,121.0
1,2,K2 / Qogir / Godwin Austen,8611,28251,4017,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44.0
2,3,Kangchenjunga,8586,28169,3922,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24.0
3,4,Lhotse,8516,27940,610,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26.0
4,5,Makalu,8485,27838,2386,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52.0


In [13]:
print(mountains.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 11 columns):
Rank                         118 non-null int64
Mountain                     118 non-null object
Height (m)                   118 non-null int64
Height (ft)                  118 non-null int64
Prominence (m)               118 non-null int64
Range                        118 non-null object
Coordinates                  118 non-null object
Parent mountain              117 non-null object
First ascent                 118 non-null object
Ascents bef. 2004            116 non-null object
Failed attempts bef. 2004    115 non-null float64
dtypes: float64(1), int64(4), object(6)
memory usage: 10.2+ KB
None


## Extracting specific data points

Initially, we could just extract a random row just to see the information. The method `.loc[]` is used to select rows (with a valid index, otherwise it will return an error). For example, let's see what is the 20th hghest mountain in the world:

In [36]:
# Notice that the rows are 0 indexed, hence the 20th row is indexed as 19
row20 = mountains.loc[19]
print(row20)

Rank                                              19
Mountain                               Distaghil Sar
Height (m)                                      7884
Height (ft)                                    25866
Prominence (m)                                  2525
Range                               Hispar Karakoram
Coordinates                  36°19′33″N 75°11′16″E﻿ 
Parent mountain                                   K2
First ascent                                    1960
Ascents bef. 2004                                  3
Failed attempts bef. 2004                          5
Name: 19, dtype: object


The method `.loc[]` is nice because it returns an object with the data for the specified row and the labels for each column, which makes it more readable. In this particular example, we can see that even though the first ascent was on 1960, there has only been 8 attemps with only 3 being successful. At 7884 m this is a pretty high mountain on a remote location. <br>
Similar to the use of indexes with python lists, we can extract multiple rows by chosing a range of rows `[n:m]` or a list of rows `[n,m,k]`:

In [26]:
rows20_23 = mountains.loc[19:22]
print(rows20_23)

    Rank         Mountain  Height (m)  Height (ft)  Prominence (m)  \
19    19    Distaghil Sar        7884        25866            2525   
20    20      Ngadi Chuli        7871        25823            1020   
21   111           Nuptse        7864        25801             319   
22    21  Khunyang Chhish        7823        25666            1765   

                  Range              Coordinates Parent mountain First ascent  \
19     Hispar Karakoram  36°19′33″N 75°11′16″E﻿               K2         1960   
20     Manaslu Himalaya  28°30′12″N 84°34′00″E﻿          Manaslu         1970   
21  Mahalangur Himalaya  27°58′03″N 86°53′13″E﻿           Lhotse         1961   
22     Hispar Karakoram  36°12′19″N 75°12′28″E﻿    Distaghil Sar         1971   

   Ascents bef. 2004  Failed attempts bef. 2004  
19                 3                        5.0  
20                 2                        6.0  
21                 5                       12.0  
22                 2                       

In [28]:
# Noticed that the sintax for indexing non-consecutive rows is a python list inside the loc's brackets
random_rows = mountains.loc[[3,67,83]]
print(random_rows)

    Rank           Mountain  Height (m)  Height (ft)  Prominence (m)  \
3      4             Lhotse        8516        27940             610   
67    62  Yangra / Ganesh I        7422        24350            2352   
83    77      Nangpai Gosum        7350        24114             500   

                  Range              Coordinates Parent mountain First ascent  \
3   Mahalangur Himalaya  27°57′42″N 86°55′59″E﻿    Mount Everest         1956   
67      Ganesh Himalaya  28°23′29″N 85°07′38″E﻿          Manaslu         1955   
83  Mahalangur Himalaya  28°04′24″N 86°36′51″E﻿          Cho Oyu         1996   

   Ascents bef. 2004  Failed attempts bef. 2004  
3                 26                       26.0  
67                 1                        6.0  
83                 3                        1.0  


For accessing a column, it s as easy as using bracket notation and the string corresponding to the column name. For example, if we wanted to see all the entries corresponding to the **Range** column, we will do: `mountains["Range"]`. Python returns a Series object that contains the values corresponding to that column, along with the row numbers for each entry. An example is not shown here because it will populate a large part of the notebook.<br>
We can combine the two previous examples that show how to address a row and a column in order to extract a particular entry in the dataset. Taking the previous example for the 20th highest mountain, we could ask: ***where is this mountain located?***

In [31]:
coordinates_mount_20 = mountains.loc[19]["Coordinates"]
print(coordinates_mount_20)

36°19′33″N 75°11′16″E﻿ 


The result shows the coordinates of **Distaghil Sar**

## Descriptive statistics

In order to get a [statistical summary](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) of the **numerical** values in our dataset, we can use the method `.describe()`.



In [39]:
mountains.describe()

Unnamed: 0,Rank,Height (m),Height (ft),Prominence (m),Failed attempts bef. 2004
count,118.0,118.0,118.0,118.0,115.0
mean,59.5,7578.042373,24862.364407,1671.567797,8.4
std,34.207699,341.471211,1120.311905,1234.813419,15.782958
min,1.0,7200.0,23622.0,217.0,0.0
25%,30.25,7316.5,24004.0,712.75,1.0
50%,59.5,7472.5,24516.5,1332.5,3.0
75%,88.75,7775.5,25509.75,2297.25,11.0
max,118.0,8848.0,29029.0,8848.0,121.0


This method excludes the NaN values as well. We can see that we are missing 3 values from the column *Failed attempts bef. 2004* (see variable **cont**). Moreover, the 75th percentile tells us that only 25% of the mountains on the list are above 7775.5 m.