# Data Exploration

## Instructions
There are some data files related to transaction saved under the [data](../data) folder:
- Looking into the data using appropriate functions and extract the fields in the data.
- For each data, describe what the data is about and what fields are saved.

You need to answer the questions and perform the task below:
- How many transactions are in GBP?
- How many transactions are NOT in USD?
- What is the average and mediam transaction in USD?
- Construct a table showing the number of transactions in EACH currency

Note:
- You are NOT ALLOWED to import other library or package
- You can write you own functions
- Your answers should be readable with approprate comments
- You can refer to [markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) if you are not familar with Markdown

## Import libraries 

In [3]:
# Usual libraries are imported here
import os
import yaml
import dask.dataframe as dd
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Please perform your tasks below and answer the questions

## countries

â€˜countries.csv' is about 226 countries' information with 5 columns, including 'CODE NAME CODE3 NUMCODE PHONECODE' 

In [8]:
countries = pd.read_csv('../data/countries.csv')
countries.info()
countries.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 5 columns):
CODE         225 non-null object
NAME         226 non-null object
CODE3        226 non-null object
NUMCODE      226 non-null int64
PHONECODE    226 non-null int64
dtypes: int64(2), object(3)
memory usage: 8.9+ KB


Unnamed: 0,CODE,NAME,CODE3,NUMCODE,PHONECODE
0,AF,Afghanistan,AFG,4,93
1,AL,Albania,ALB,8,355
2,DZ,Algeria,DZA,12,213
3,AS,American Samoa,ASM,16,1684
4,AO,Angola,AGO,24,244


## currency_details

'currency_details.csv' is about 184 currency, including its Exponent and whether it is crypto.

In [4]:
currency_details = pd.read_csv('../data/currency_details.csv')
currency_details.info()
currency_details.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184 entries, 0 to 183
Data columns (total 3 columns):
CCY          184 non-null object
EXPONENT     184 non-null int64
IS_CRYPTO    184 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 3.1+ KB


Unnamed: 0,CCY,EXPONENT,IS_CRYPTO
0,AED,2,False
1,AFN,2,False
2,ALL,2,False
3,AMD,2,False
4,ANG,2,False


## fraudsters

'fraudsters.csv' is about 298 fraudsters's user id

In [15]:
fraudsters = pd.read_csv('../data/fraudsters.csv')
fraudsters.info()
fraudsters.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 1 columns):
USER_ID    298 non-null object
dtypes: object(1)
memory usage: 2.4+ KB


Unnamed: 0,USER_ID
0,5270b0f4-2e4a-4ec9-8648-2135312ac1c4
1,848fc1b1-096c-40f7-b04a-1399c469e421
2,27c76eda-e159-4df3-845a-e13f4e28a8b5
3,a27088ef-9452-403d-9bbb-f7b10180cdda
4,fb23710b-609a-49bf-8a9a-be49c59ce6de


## 'users'

'users.csv' is about 9944 users, telling 11 columns of ther information. It includes ID, whether they have email, their Phone country, whether they are fraudster, their Terms' version, Created, State, Country, Birth Year, KYC, and the times they failed attempting to sign in.

In [16]:
users = pd.read_csv('../data/users.csv')
users.info()
users.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9944 entries, 0 to 9943
Data columns (total 11 columns):
ID                         9944 non-null object
HAS_EMAIL                  9944 non-null int64
PHONE_COUNTRY              9944 non-null object
IS_FRAUDSTER               9944 non-null bool
TERMS_VERSION              8417 non-null object
CREATED_DATE               9944 non-null object
STATE                      9944 non-null object
COUNTRY                    9944 non-null object
BIRTH_YEAR                 9944 non-null int64
KYC                        9944 non-null object
FAILED_SIGN_IN_ATTEMPTS    9944 non-null int64
dtypes: bool(1), int64(3), object(7)
memory usage: 786.7+ KB


Unnamed: 0,ID,HAS_EMAIL,PHONE_COUNTRY,IS_FRAUDSTER,TERMS_VERSION,CREATED_DATE,STATE,COUNTRY,BIRTH_YEAR,KYC,FAILED_SIGN_IN_ATTEMPTS
0,1872820f-e3ac-4c02-bdc7-727897b60043,1,GB||JE||IM||GG,False,2018-05-25,2017-08-06 07:33:33.341000,ACTIVE,GB,1971,PASSED,0
1,545ff94d-66f8-4bea-b398-84425fb2301e,1,GB||JE||IM||GG,False,2018-01-01,2017-03-07 10:18:59.427000,ACTIVE,GB,1982,PASSED,0
2,10376f1a-a28a-4885-8daa-c8ca496026bb,1,ES,False,2018-09-20,2018-05-31 04:41:24.672000,ACTIVE,ES,1973,PASSED,0
3,fd308db7-0753-4377-879f-6ecf2af14e4f,1,FR,False,2018-05-25,2018-06-01 17:24:23.852000,ACTIVE,FR,1986,PASSED,0
4,755fe256-a34d-4853-b7ca-d9bb991a86d3,1,GB||JE||IM||GG,False,2018-09-20,2017-08-09 15:03:33.945000,ACTIVE,GB,1989,PASSED,0


## transactions

In [8]:
transaction = pd.read_csv('../data/transactions.csv')
transaction.info()
transaction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688651 entries, 0 to 688650
Data columns (total 12 columns):
CURRENCY             688651 non-null object
AMOUNT               688651 non-null int64
STATE                688651 non-null object
CREATED_DATE         688651 non-null object
MERCHANT_CATEGORY    223065 non-null object
MERCHANT_COUNTRY     483055 non-null object
ENTRY_METHOD         688651 non-null object
USER_ID              688651 non-null object
TYPE                 688651 non-null object
SOURCE               688651 non-null object
ID                   688651 non-null object
AMOUNT_USD           688651 non-null int64
dtypes: int64(2), object(10)
memory usage: 63.0+ MB


Unnamed: 0,CURRENCY,AMOUNT,STATE,CREATED_DATE,MERCHANT_CATEGORY,MERCHANT_COUNTRY,ENTRY_METHOD,USER_ID,TYPE,SOURCE,ID,AMOUNT_USD
0,GBP,175,COMPLETED,2017-12-20 12:46:20.294,cafe,GBR,cont,8f99c254-7cf2-4e35-b7e4-53804d42445d,CARD_PAYMENT,GAIA,b3332e6f-7865-4d6e-b6a5-370bc75568d8,220
1,EUR,2593,COMPLETED,2017-12-20 12:38:47.232,bar,AUS,cont,ed773c34-2b83-4f70-a691-6a7aa1cb9f11,CARD_PAYMENT,GAIA,853d9ff8-a007-40ef-91a2-7d81e29a309a,2885
2,EUR,1077,COMPLETED,2017-12-20 12:34:39.668,,CZE,cont,eb349cc1-e986-4bf4-bb75-72280a7b8680,CARD_PAYMENT,GAIA,04de8238-7828-4e46-91f1-050a9aa7a9df,1198
3,GBP,198,COMPLETED,2017-12-20 12:45:50.555,supermarket,GBR,cont,dc78fbc4-c936-45d3-a813-e2477ac6d74b,CARD_PAYMENT,GAIA,2b790b9b-c312-4098-a4b3-4830fc8cda53,249
4,EUR,990,COMPLETED,2017-12-20 12:45:32.722,,FRA,cont,32958a5c-2532-42f7-94f9-127f2a812a55,CARD_PAYMENT,GAIA,6469fc3a-e535-41e9-91b9-acb46d1cc65d,1101


- How many transactions are in GBP?
- How many transactions are NOT in USD?
- What is the average and mediam transaction in USD?
- Construct a table showing the number of transactions in EACH currency

In [70]:
print(list(transaction['CURRENCY']).count('GBP'),'transactions in GBP')
print(len(transaction[transaction['CURRENCY'] != 'USD']),'transactions NOT in USD')
print('The average and mediam transactions in USD are',np.mean(transaction[transaction['CURRENCY'] == 'USD'].AMOUNT),',',np.median(transaction[transaction['CURRENCY'] == 'USD'].AMOUNT))

339091 transactions in GBP
657109 transactions NOT in USD
The average and mediam transactions in USD are 11598.75470800837 , 2000.0


In [80]:
unique,counts = np.unique(list(transaction['CURRENCY']),return_counts=True)
dict1 = dict(zip(unique,counts))
pd.Series(dict1)

AED       847
AUD      2110
BTC       283
CAD      1463
CHF      5761
CZK      1507
DKK      1711
ETH       197
EUR    264695
GBP    339091
HKD       480
HUF      1446
ILS       522
INR       207
JPY       733
LTC       137
MAD       115
NOK      2602
NZD       717
PLN     22362
QAR        28
RON      5837
SEK      1579
SGD       487
THB       690
TRY       338
USD     31542
XRP        38
ZAR      1126
dtype: int64