# Ex2 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = SparkSession.builder \
    .appName('Based on Pandas Exercises - 1') \
    .getOrCreate()

spark

22/12/06 16:25:32 WARN Utils: Your hostname, karlos-300E5M-300E5L resolves to a loopback address: 127.0.1.1; using 10.0.0.89 instead (on interface wlp2s0)
22/12/06 16:25:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/06 16:25:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv). 

### Step 3. Assign it to a variable called euro12.

In [11]:
euro12 = spark.read.csv(
    './Euro_2012_stats_TEAM.csv',
    header=True,
    inferSchema=True
)

euro12

DataFrame[Team: string, Goals: int, Shots on target: int, Shots off target: int, Shooting Accuracy: string, % Goals-to-shots: string, Total shots (inc. Blocked): int, Hit Woodwork: int, Penalty goals: int, Penalties not scored: int, Headed goals: int, Passes: int, Passes completed: int, Passing Accuracy: string, Touches: int, Crosses: int, Dribbles: int, Corners Taken: int, Tackles: int, Clearances: int, Interceptions: int, Clearances off line: int, Clean Sheets: int, Blocks: int, Goals conceded: int, Saves made: int, Saves-to-shots ratio: string, Fouls Won: int, Fouls Conceded: int, Offsides: int, Yellow Cards: int, Red Cards: int, Subs on: int, Subs off: int, Players Used: int]

In [14]:
euro12.select(euro12.columns[:5]) \
    .show()

+-------------------+-----+---------------+----------------+-----------------+
|               Team|Goals|Shots on target|Shots off target|Shooting Accuracy|
+-------------------+-----+---------------+----------------+-----------------+
|            Croatia|    4|             13|              12|            51.9%|
|     Czech Republic|    4|             13|              18|            41.9%|
|            Denmark|    4|             10|              10|            50.0%|
|            England|    5|             11|              18|            50.0%|
|             France|    3|             22|              24|            37.9%|
|            Germany|   10|             32|              32|            47.8%|
|             Greece|    5|              8|              18|            30.7%|
|              Italy|    6|             34|              45|            43.0%|
|        Netherlands|    2|             12|              36|            25.0%|
|             Poland|    2|             15|         

### Step 4. Select only the Goal column.

In [16]:
euro12.select('Goals') \
    .show()

+-----+
|Goals|
+-----+
|    4|
|    4|
|    4|
|    5|
|    3|
|   10|
|    5|
|    6|
|    2|
|    2|
|    6|
|    1|
|    5|
|   12|
|    5|
|    2|
+-----+



### Step 5. How many team participated in the Euro2012?

In [17]:
euro12.count()

16

### Step 6. What is the number of columns in the dataset?

In [19]:
len(euro12.columns)

35

### Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

In [22]:
discipline = euro12.select('Team', 'Yellow Cards', 'Red Cards')
discipline

DataFrame[Team: string, Yellow Cards: int, Red Cards: int]

### Step 8. Sort the teams by Red Cards, then to Yellow Cards

In [25]:
discipline.orderBy('Red Cards', 'Yellow Cards') \
    .show()

+-------------------+------------+---------+
|               Team|Yellow Cards|Red Cards|
+-------------------+------------+---------+
|            Denmark|           4|        0|
|            Germany|           4|        0|
|        Netherlands|           5|        0|
|            Ukraine|           5|        0|
|            England|           5|        0|
|             France|           6|        0|
|             Russia|           6|        0|
|     Czech Republic|           7|        0|
|             Sweden|           7|        0|
|            Croatia|           9|        0|
|              Spain|          11|        0|
|           Portugal|          12|        0|
|              Italy|          16|        0|
|Republic of Ireland|           6|        1|
|             Poland|           7|        1|
|             Greece|           9|        1|
+-------------------+------------+---------+



### Step 9. Calculate the mean Yellow Cards given per Team

In [29]:
discipline.select(F.avg('Yellow Cards')).show()

+-----------------+
|avg(Yellow Cards)|
+-----------------+
|           7.4375|
+-----------------+



### Step 10. Filter teams that scored more than 6 goals

In [31]:
euro12.select('Team', 'Goals') \
    .where(F.col('Goals') > 6) \
    .show()

+-------+-----+
|   Team|Goals|
+-------+-----+
|Germany|   10|
|  Spain|   12|
+-------+-----+



### Step 11. Select the teams that start with G

In [33]:
euro12.select('Team') \
    .where(F.col('Team').startswith('G')) \
    .show()

+-------+
|   Team|
+-------+
|Germany|
| Greece|
+-------+



### Step 12. Select the first 7 columns

In [37]:
# euro12.iloc[:, :7]

euro12.select(*euro12.columns[:7])

AnalysisException: Column '`Total shots (inc`.` Blocked)`' does not exist. Did you mean one of the following? [Total shots (inc. Blocked), Goals conceded, Penalties not scored, Fouls Conceded, Passes completed, Shots on target, Passing Accuracy, Shooting Accuracy, Shots off target, Clearances off line, % Goals-to-shots, Blocks, Corners Taken, Fouls Won, Hit Woodwork, Penalty goals, Players Used, Saves-to-shots ratio, Tackles, Clean Sheets, Clearances, Crosses, Goals, Interceptions, Passes, Saves made, Touches, Yellow Cards, Dribbles, Headed goals, Offsides, Subs off, Subs on, Red Cards, Team];
'Project [Team#543, Goals#544, Shots on target#545, Shots off target#546, Shooting Accuracy#547, % Goals-to-shots#548, 'Total shots (inc. Blocked)]
+- Relation [Team#543,Goals#544,Shots on target#545,Shots off target#546,Shooting Accuracy#547,% Goals-to-shots#548,Total shots (inc. Blocked)#549,Hit Woodwork#550,Penalty goals#551,Penalties not scored#552,Headed goals#553,Passes#554,Passes completed#555,Passing Accuracy#556,Touches#557,Crosses#558,Dribbles#559,Corners Taken#560,Tackles#561,Clearances#562,Interceptions#563,Clearances off line#564,Clean Sheets#565,Blocks#566,... 11 more fields] csv


### Step 13. Select all columns except the last 3.

In [38]:
# euro12.iloc[:, :-3]

euro12.select(*euro12.columns[:-3])

AnalysisException: Column '`Total shots (inc`.` Blocked)`' does not exist. Did you mean one of the following? [Total shots (inc. Blocked), Goals conceded, Penalties not scored, Fouls Conceded, Passes completed, Shots on target, Passing Accuracy, Shooting Accuracy, Shots off target, Clearances off line, % Goals-to-shots, Blocks, Corners Taken, Fouls Won, Hit Woodwork, Penalty goals, Players Used, Saves-to-shots ratio, Tackles, Clean Sheets, Clearances, Crosses, Goals, Interceptions, Passes, Saves made, Touches, Yellow Cards, Dribbles, Headed goals, Offsides, Subs off, Subs on, Red Cards, Team];
'Project [Team#543, Goals#544, Shots on target#545, Shots off target#546, Shooting Accuracy#547, % Goals-to-shots#548, 'Total shots (inc. Blocked), Hit Woodwork#550, Penalty goals#551, Penalties not scored#552, Headed goals#553, Passes#554, Passes completed#555, Passing Accuracy#556, Touches#557, Crosses#558, Dribbles#559, Corners Taken#560, Tackles#561, Clearances#562, Interceptions#563, Clearances off line#564, Clean Sheets#565, Blocks#566, ... 8 more fields]
+- Relation [Team#543,Goals#544,Shots on target#545,Shots off target#546,Shooting Accuracy#547,% Goals-to-shots#548,Total shots (inc. Blocked)#549,Hit Woodwork#550,Penalty goals#551,Penalties not scored#552,Headed goals#553,Passes#554,Passes completed#555,Passing Accuracy#556,Touches#557,Crosses#558,Dribbles#559,Corners Taken#560,Tackles#561,Clearances#562,Interceptions#563,Clearances off line#564,Clean Sheets#565,Blocks#566,... 11 more fields] csv


### Step 14. Present only the Shooting Accuracy from England, Italy and Russia

In [40]:
# teams = ['England', 'Italy', 'Russia']

# # Alternative: euro12[euro12['Team'].isin(teams)][['Team', 'Shooting Accuracy']]
# euro12.loc[euro12['Team'].isin(teams), ['Team', 'Shooting Accuracy']]

teams = ['England', 'Italy', 'Russia']

euro12.select('Team','Shooting Accuracy') \
    .where(F.col('Team').isin(teams)) \
    .show()

+-------+-----------------+
|   Team|Shooting Accuracy|
+-------+-----------------+
|England|            50.0%|
|  Italy|            43.0%|
| Russia|            22.5%|
+-------+-----------------+

