<center><b>© 2021. Content is made available under the CC-BY-NC-ND 4.0 license. Christian Lopez, lopezbec@lafayette.edu<center>

#Data Understanding: Python Colab Introduction

## 1- Loading dataset from a package


Along with Numpy, [Pandas](https://pandas.pydata.org/) is one of the most used packages of python when it come to manipulating data. This would not be an introduction to Pandas, if you would like to learm more look at  [Python Data Science Handbook Chapter 3]( https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb).

We will be using the "diamonds" dataset from th [seaborn](https://seaborn.pydata.org/) pacakge.




In [1]:
import seaborn as sns
import pandas as pd

data = sns.load_dataset('diamonds')


This data is already a Pandas Data Frame object.

In [2]:
type(data)

In [3]:
data.head()  #The "head" method prints the frist 5 rows of the data frame

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
data.tail() #The "tail" method prints the last 5 rows of the data frame

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


### Data dimensions

Once you see the data, you need to check how many features (columns) it has, and how many entities (rows). For this we can use the Pandas Data Frame method of ["shape"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html?highlight=shape#pandas.DataFrame.shape) (this is very similar to the Numpy Array method of shape)

In [5]:
data.shape    #The 'shape" method show the total numer of row and columns

(53940, 10)

### Data Structure and type

To learn more about the different data types of Python review [Python Data Science Handbook Chapter 2]( https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#scrollTo=L6xNSSrJr_ho).  

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)|
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)|
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)|
| ``int8``      | Byte (-128 to 127)|
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)|
| ``uint8``     | Unsigned integer (0 to 255)|
| ``uint16``    | Unsigned integer (0 to 65535)|
| ``uint32``    | Unsigned integer (0 to 4294967295)|
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)|
| ``float_``    | Shorthand for ``float64``.|
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa|
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa|
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa|
| ``complex_``  | Shorthand for ``complex128``.|
| ``complex64`` | Complex number, represented by two 32-bit floats|
| ``complex128``| Complex number, represented by two 64-bit floats|

In [6]:
data.dtypes

Unnamed: 0,0
carat,float64
cut,category
color,category
clarity,category
depth,float64
table,float64
price,int64
x,float64
y,float64
z,float64


From this we can see that we have a mix of nominal data types (i.e., 'category') and numeric data types (i.e., 'float64' & 'int64').

### Data Summary

We can use the Pandas Data Frame method ["describe()"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) to extract some summary statistics about our data

In [7]:
data.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


To calculate the mode, we can use the Pandas Data Frame method ["mode()"](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html)

In [8]:
data.mode()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.3,Ideal,G,SI1,62.0,56.0,605,4.37,4.34,2.7


## 2- Loading Data manually


A useful feature from Google Colab, is that you can access and manipulate files from your Google Drive. When you work directly with Jupyter Notebook in your computer, this would be like working with files directly in your computer.

To access the Google Drive files, we need to first “mount” your google drive and provide the “directory” you want to work on. Run the code cell, click on the link,  provide the code for your Google Drive, and press ENTER.


In [9]:
import os
from google.colab import drive

drive.mount('/content/gdrive')

Working_Directory = 'My Drive/DS_201' #@param {type:"string"}
wd="/content/gdrive/"+Working_Directory
os.chdir(wd)


dirpath = os.getcwd()
print("current directory is : " + dirpath)





Mounted at /content/gdrive
current directory is : /content/gdrive/My Drive/DS_201


This command show all the files in the current directory

Now, we can directly download files from an URL (like GitHub).

We can use the 'wget' command line:

In [11]:
!wget "https://raw.githubusercontent.com/lopezbec/COVID19_Tweets_Dataset/main/Summary_Details/SUMMARY_moth.csv"

--2025-01-30 00:45:01--  https://raw.githubusercontent.com/lopezbec/COVID19_Tweets_Dataset/main/Summary_Details/SUMMARY_moth.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3306 (3.2K) [text/plain]
Saving to: ‘SUMMARY_moth.csv’


2025-01-30 00:45:02 (897 KB/s) - ‘SUMMARY_moth.csv’ saved [3306/3306]



OR we can use the request package  (if using Jupyter notebooks in you computer, you might not be able to run the "wget" commands)

In [13]:
# imported the requests library
import requests
URL = "https://raw.githubusercontent.com/lopezbec/COVID19_Tweets_Dataset/main/Summary_Details/SUMMARY_moth.csv"

# URL of the file to be downloaded
r = requests.get(URL) # create HTTP response object

# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("SUMMARY_moth_v2.csv",'wb') as f:

    # Saving received content as a csv file in
    # binary format

    # write the contents of the response (r.content)
    # to a new file in binary mode.
    f.write(r.content)

Now that the file is in the working direcotyl, we can read it suing Pandas ‘read_csv” method

In [15]:
import pandas as pd

data = pd.read_csv('SUMMARY_moth_v2.csv')

data.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,Avg #OR,Avg #RT,Avg Tweets,# OR,# RT,#Total,Total Geo,Max Rt,MD RT,Max Like,MD Like
0,1,2020,1,5947.0,30576.5,35501.5,1958346,7852504,9810850,1773,674151,166.5,334802,0
1,2,2020,2,10978.0,29918.0,40604.5,7624648,21944443,29568948,8103,469739,50.0,637589,0
2,3,2020,3,13095.5,44714.5,56283.0,12610824,46659589,59270412,19952,1064693,159.0,1255858,0
3,4,2020,4,30091.0,89513.0,119859.5,20594379,60311559,80905936,38220,649823,36.0,662005,0
4,5,2020,5,35163.0,100022.5,135709.0,26307406,73792461,100099863,47777,1007616,27.0,929811,0


In [16]:
!jupyter nbconvert --to markdown "Data_Understanding.ipynb"

[NbConvertApp] Converting notebook Data_Understanding.ipynb to markdown
[NbConvertApp] Writing 36753 bytes to Data_Understanding.md


In [17]:
!pwd

/content/gdrive/MyDrive/DS_201
