<center><b>© Content is made available under the CC-BY-NC-ND 4.0 license. Christian Lopez, lopezbec@lafayette.edu<center>

#Data Understanding: Python Colab Introduction

## 1- Loading dataset from a package


Along with Numpy, [Pandas](https://pandas.pydata.org/) is one of the most used packages of python when it come to manipulating data. This would not be an introduction to Pandas, if you would like to learm more look at  [Python Data Science Handbook Chapter 3]( https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb). 

We will be using the "diamonds" dataset from th [seaborn](https://seaborn.pydata.org/) pacakge. 




In [None]:
import seaborn as sns
import pandas as pd

data = sns.load_dataset('diamonds')


This data is already a Pandas Data Frame object.

In [None]:
type(data)

In [None]:
pd.DataFrame.head(data)

But isn't "data" a "pandas.DataFrame"?.....YES

In [None]:
data.head()  #The "head" method prints the frist 5 rows of the data frame

In [None]:
data.tail() #The "tail" method prints the last 5 rows of the data frame

### Data dimensions

Once you see the data, you need to check how many features (columns) it has, and how many entities (rows). For this we can use the Pandas Data Frame method of ["shape"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html?highlight=shape#pandas.DataFrame.shape) (this is very similar to the Numpy Array method of shape)

In [None]:
data.shape    #The 'shape" method show the total numer of row and columns

### Data Structure and type

To learn more about the different data types of Python review [Python Data Science Handbook Chapter 2]( https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#scrollTo=L6xNSSrJr_ho).  

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

In [None]:
data.dtypes

From this we can see that we have a mix of nominal data types (i.e., 'category') and numeric data types (i.e., 'float64' & 'int64'). 

### Data Summary

We can use the Pandas Data Frame method ["describe()"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) to extract some summary statistics about our data

In [None]:
data.describe()

To calculate the mode, we can use the Pandas Data Frame method ["mode()"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) 

In [None]:
data.mode()

## 2- Loading Data manually


A useful feature from Google Colab, is that you can access and manipulate files from your Google Drive. When you work directly with Jupyter Notebook in your computer, this would be like working with files directly in your computer. 

To access the Google Drive files, we need to first “mount” your google drive and provide the “directory” you want to work on. Run the code cell, click on the link,  provide the code for your Google Drive, and press ENTER. 


In [None]:
import os
from google.colab import drive 

drive.mount('/content/gdrive')

Working_Directory = 'My Drive/DS_201' #@param {type:"string"}
wd="/content/gdrive/"+Working_Directory
os.chdir(wd)


dirpath = os.getcwd()
print("current directory is : " + dirpath)





In [None]:
!cd DS_201

This command show all the files in the current directory

In [None]:
!ls

Now, we can directly download files from an URL (like GitHub). 

We can use the 'wget' command line:

In [None]:
!wget "https://raw.githubusercontent.com/lopezbec/COVID19_Tweets_Dataset/main/Summary_Details/SUMMARY_moth.csv"

In [None]:
!ls

OR we can use the request package  (if using Jupyter notebooks in you computer, you might not be able to run the "wget" commands)

In [None]:
# imported the requests library
import requests
URL = "https://raw.githubusercontent.com/lopezbec/COVID19_Tweets_Dataset/main/Summary_Details/SUMMARY_moth.csv"
  
# URL of the file to be downloaded 
r = requests.get(URL) # create HTTP response object
  
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("SUMMARY_moth_v2.csv",'wb') as f:
  
    # Saving received content as a csv file in
    # binary format
  
    # write the contents of the response (r.content)
    # to a new file in binary mode.
    f.write(r.content)

In [None]:
!ls

Now that the file is in the working direcotyl, we can read it suing Pandas ‘read_csv” method

In [None]:
import pandas as pd

data = pd.read_csv('SUMMARY_moth_v2.csv')

data.head()