# MACS Project
This is the notebook for evaluating your chosen datasets. Feel free to experiment, there are also good tutorials around in the web.
***
Let us first start by importing the used libraries in Python. If you need additional libraries, just add them here and re-run the cell for a better overview.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

If no error message occurs, the libraries are all found and ready to use. Otherwise, you might have to install them.
***
Next, we continue importing our datasets. There are different imports available depending on dataset types (see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

In [2]:
# Read the file "data.csv" (must be in same folder than the notebook,
# otherwise you have to add the corresponding path) and assign it to
# the variable named "car"
car = pd.read_csv("data.csv")

After importing the data, we can have a look on it as well as applying some metrics.

In [3]:
# Displays the first three lines of the dataset
car.head(3)

Unnamed: 0.1,Unnamed: 0,ISO,EVENT_DATE,EVENT_TYPE,SUB_EVENT_TYPE,ACTOR1,ASSOC_ACTOR_1,ACTOR2,ASSOC_ACTOR_2,INTERACTION,REGION,COUNTRY,ADMIN1,ADMIN2,LOCATION,SOURCE,NOTES,FATALITIES,TIMESTAMP
0,0,12,2010-12-20,Battles,Armed clash,Military Forces of Algeria (1999-),,Unidentified Armed Group (Algeria),,13,Northern Africa,Algeria,Boumerdes,Ammal,Ait Dahmane,TSA Algerie,A militant was captured by security forces on ...,0,1563903165
1,1,12,2010-12-25,Riots,Violent demonstration,Police Forces of Algeria (1999-),,Rioters (Algeria),,15,Northern Africa,Algeria,Alger,Sidi M'Hamed,Algiers,Liberte (Algeria),Riots broke out in districts covered by the pr...,0,1579554013
2,2,12,2010-12-26,Battles,Armed clash,AQIM: Al Qaeda in the Islamic Maghreb,,Military Forces of Algeria (1999-),,12,Northern Africa,Algeria,Jijel,Jijel,Jijel,Xinhua,Two AQLMI militants are killed and five wounde...,2,1572403789


In [13]:
# Displays the last three lines of the dataset
car.tail(3)

Unnamed: 0,Review,Rating
20488,"ok just looks nice modern outside, desk staff ...",2
20489,hotel theft ruined vacation hotel opened sept ...,1
20490,"people talking, ca n't believe excellent ratin...",2


In [14]:
# Shows various information, e.g. amount of lines, columns,
# how many of them have missing values (Null) and data types
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


In [15]:
# We can calculate mean, standard deviation, min/max
# on numerical attributes
car.describe()

Unnamed: 0,Rating
count,20491.0
mean,3.952223
std,1.23303
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,5.0


In [19]:
# The following command checks for all values if they are null
# (Boolean) and summarizes them
car.isnull().sum()

Review    0
Rating    0
dtype: int64

In [17]:
# Selecting a single column with squared brackets
car['Market Category']

KeyError: 'Market Category'

In [None]:
# Selecting data matching a predicate (here: attribute with
# missing values)
car[car['Market Category'].isnull()]

In [None]:
# Replace NaN (null) values with e.g. 'unknown' string
car['Market Category'].fillna('unknown', inplace=True)

In [None]:
car['Market Category'].isnull().sum()

In [None]:
# Calculates the correlation coefficient between attributes
car.corr()

If you need more information about commands, you can have a look on the documentation:
* info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html
* describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
* isnull: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html
* fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
* corr: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
***
Next, we can also use plots to visualize attributes (Matplotlib, https://matplotlib.org/api/pyplot_api.html).

In [None]:
# Histogram; good to see distribution of values
plt.hist(car['Year'])

In [None]:
# It also works on categorical attributes
plt.hist(car['Vehicle Size'])

In [None]:
# Pie chart; First argument counts all appearings per unique value,
# second argument is for the labeling
plt.pie(car['Vehicle Size'].value_counts(), labels=car['Vehicle Size'].unique())

In [18]:
# Simple line plot, for car dataset not that useful
# (better if your data is ordered by some criterium)
plt.plot(car['MSRP'])

KeyError: 'MSRP'

There are many more plot types and also customization. If you are interested, feel free to have a look on them.
* https://matplotlib.org/3.3.2/tutorials/introductory/sample_plots.html
* https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html
* https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.pie.html
* https://matplotlib.org/tutorials/introductory/pyplot.html