# Feature Engineering


## First Assessment

1. You are testing the null hypothesis for several groups in a given dataset using analysis of variance (ANOVA) and your calculation of the F-statistic is 12. What information is missing in order to reject the null hypothesis or not?
    - [ ] There is no missing information, any F > 0 indicates the null hypothesis should be rejected.
    - [ ] The between and within sum of squares, and the selected critical value alpha.
    - [ ] There is no missing information, any F > 1 indicates the null hypothesis should be rejected.
    - [x] <span style='background:yellow'>The degrees of freedom of the between and within variabilities, and the selected critical value alpha.</span>


2. You are implementing a nearest neighbor classifier on a dataset whose features need to be scaled. How can you use the MinMaxScaler method to achieve this task?

```python
sklearn.preprocessing.MinMaxScaler().fit_transform(features)
```


3. What is a text corpus?
    - [ ] The body of a text (as opposed to the introduction and conclusion).
    - [ ] A subset of documents resulting from a query in a system.
    - [ ] The structure of a given text.
    - [x] <span style='background:yellow'>The entire set of documents involved in a system.</span>


4. The principal component in the principal component analysis (PCA) procedure has the largest what?
    - [ ] Entropy
    - [ ] Mean
    - [x] Variance
    - [ ] Significance


5. Which of the following algorithms can be used to assess the following relationship: King is to Queen as Husband is to Wife?
    - [ ] Bag-of-n-grams
    - [x] <span style='background:yellow'>Word2Vec</span>
    - [ ] Parts-of-speech
    - [ ] Bag-of-Words


6. What are stop words?
    - [ ] The final word of every sentence.
    - [ ] Very rare words in the text corpus.
    - [ ] The word that separates two topics in the text.
    - [x] <span style='background:yellow'>Very common words in the language (e.g. the, a, is).</span>


7. What do Parts-of-speech (POS) tags identify?
    - [x] <span style='background:yellow'>The role a word has in a particular sentence (i.e. the noun, article, conjunction, etc.).</span>
    - [ ] The role a word performs in a sentence (i.e. obscure can be a noun and a verb).
    - [ ] The role a sentence has in a text (i.e. topic, support, transition).
    - [ ] The logical propositions given by the text (i.e. P or Q > R ).


8. You have an array of given values: `['#$%', 'ALpd', '123', 89]` How can you label encode this data using Scikit-Learn method?
    - `sklearn.preprocessing.LabelEncoder().fit_transform(['#$%', 'ALpd', '123', 89])`


9. You have a coordinate vector of: [4, 5] What is the L1 norm of this vector?
    - 9


10. You are using an autoencoder for input x. After a forward pass to compute activations of all the hidden layers and obtain an output x', what is the next step?
    - [ ] Use a softmax layer to perform feature selection.
    - [ ] Use max-pooling to reduce dimensionality.
    - [ ] Compute the sigmoid of x' and subtract x from it.
    - [x] Measure the error that deviates the output x' from the input x.


11. When using a bag-of-words (BoW) representation of a document for sentiment analysis, many of the predictions fail to take into account the effect of negations. This results in misclassifications such as, "I'm not very happy about this" being assigned a positive sentiment. What is the most likely explanation for this?
    - [ ] The dataset is clearly mislabeled; BoW representations should work in this scenario.
    - [ ] There are too few samples that include negations in the dataset.
    - [x] BoW representations cannot model the role of the word in sentences.
    - [ ] BoW representations cannot be used effectively for tasks involving classification.


12. Given the ngrams(text, n) function below, which statement is the correct option to generate n-grams from a given text?
```python
def ngrams(text, n):
    words = text.split(' ')
    output = []
    output_len = len(words)-n+1
    for i in range(output_len):
        # select the correct line for this line
        output.append(' '.join(words[i:i+n]))
    return output
```


13. What is model stacking?
    - [ ] To use several models in parallel and average the outputs as the result.
    - [ ] A training optimization technique that reuses the weights of previously trained models for similar architectures.
    - [x] To use the output of a model as the input of another.
    - [ ] To use several models in parallel and use the maximum output as the result.

**Model stacking** is an efficient ensemble method in which the predictions, generated by using various machine learning algorithms, are used as inputs in a second-layer learning algorithm. This second-layer algorithm is trained to optimally combine the model predictions to form a new set of predictions.


14. You have a dataset with unbalanced categorical data, leading to low accuracy. What can you do to improve the accuracy of the model?
    - [ ] Redistribute observations between the training and test data to remove unidentified bias.
    - [ ] Normalize numerical values to fall between -1 and 1, to prevent features with high values from being dominant.
    - [x] <span style='background:yellow'>Resample the data to balance out the categories.</span>
    - [ ] Distribute observations randomly across the dataset.


15. You calculate between group variability V_b and within group variability V_w. What is the next step in a classical analysis of variance (ANOVA) test?
    - [ ] Compute F = V_b - V_w
    - [ ] Compute F = V_b * V_w
    - [ ] Compute F = V_b + V_w
    - [x] Compute F = V_b/V_w


16. Which of the following statements correctly explains the rationale behind feature selection using a variance threshold?
    - [ ] Features with values close to zero carry little information.
    - [ ] Features with a mean value close to zero carry little information.
    - [ ] Features spanning a short range carry little information.
    - [x] <span style='background:yellow'>Features with little changes in the data carry little information.</span>


17. In a bin counting scheme, what happens if data from a feature doesn't falls into any of the existing bins?
    - [ ] The data is assigned to the bin with the highest number of values.
    - [x] The data is assigned to a garbage bin.
    - [ ] The data is assigned to the bin with the lowest number of values.
    - [ ] The data is mapped to the last bin.


18. Consider the task of counting the number of livestock in aerial images of farmlands. If the machine learning model that detects animals is only capable of dealing with images where the animal is in the center, what is a very important step in the pre-processing stage of the task?
    - [ ] Transform the image so that all animals are centered and in different channels before sending it to the model.
    - [x] <span style='background:yellow'>Assigning bounding boxes to all objects so that the object is centered in its box. Individual boxes should be used as input to the model.</span>
    - [ ] Cut the image in squares of equal size. The input of the model should be the individual squares.
    - [ ] Removing the background of the image, keeping only the pixels with animals in it.
    

## Second Assessment

1. To use an image as input of a neural network, some preprocessing is required in order to make the format compatible with the network. Most architectures require no additional feature extraction. Which of the following statements explains the rationale behind this?
    - [ ] The first layers of the neural network learn to extract the features it needs to perform its task.
    - [ ] The non-linear nature of neural networks works better with unperturbed input. The backpropagation process is disrupted by transformations on the input images.
    - [ ] Neural networks adapt themselves to any input, no matter what they are. This known as reinforcement learning, the neural network corrects itself, reinforcing its ability to deal with the input.
    - [ ] Neural networks prefer to work with raw data. This allows the method to have more information in every step of the training.


2. What is the formula for normalizing a data set to unit length?
    - $x'=x/||x||$


3. You have an array with both positive and negative values: $[-2, 5, 8, 10, -12]$. What is the value of element -2 when the data is normalized using min-max scaler?
    - 0.45


4. You are using k-PCA to classify a dataset with non-linearly separable data points. The dataset is so large that the kernel matrix does not fit into memory. What would you do to make the problem manageable?
    - [ ] Use a Gaussian kernel to reflect local structure as opposed to global structure.
    - [x] <span style='background:yellow'>Perform clustering on the dataset and use the means of the clusters as data points for the kernel matrix.</span>
    - [ ] Project all the data points to a higher dimension before applying kernel PCA.
    - [ ] Perform classical PCA on the data and reduce the dimensionality before using kernel PCA


5. How would you automate Feature Engineering?
    - [ ] Downloading, merging and filtering data can be automated with common tools. All other tasks have to be performed manually.
    - [ ] Feature Engineering requires a deep understanding of the domain and currently cannot be automated.
    - [x] <span style='background:yellow'>Many tasks like merging, feature extraction, building a feature matrix can be performed using available tools. This allows the automatic creation of features for new data.</span>
    - [ ] Feature Engineering is done only once during model building. There is no need to automate it.


6. Which task is usually associated with Feature Engineering?
    - [x] <span style='background:yellow'>Imputing data</span>
    - [ ] Plotting graphs
    - [ ] Downloading data
    - [ ] Formatting numerical data

In statistics, **imputation** is the process of replacing missing data with substituted values


7. You have a square 3x3 matrix:
```python
[
[ -3., 5., 15 ], 
[ 0., 6., 14 ], 
[ 6., 3., 11 ]
]
```
How can you perform quantization on the matrix resulting into an array of {3, 2, 2}bins?
```python
sklearn.preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2]).fit_transform(matrix)```

8. What is another name for the test dataset?
    - [ ] Verification dataset
    - [x] Holdout dataset
    - [ ] Unbiased dataset
    - [ ] Cross-examination dataset


9. Which of the following is a manifold learning method?
    - [ ] Independent Component Analysis
    - [ ] Principal Component Analysis
    - [ ] Linear Discriminant Analysis
    - [x] <span style='background:yellow'>t-Distributed Stochastic Neighbor Embedding</span>


10. How does feature scaling improve the Gradient Descent process for a Linear Regression?
    - [x] The descent steps are more direct towards the minimum and it also minimizes the risk of "overshooting" the minimum.
    - [ ] Scaling down all input values reduces the size and number of steps needed to reach the minimum.
    - [ ] Feature scaling does not effect the Gradient Decent process or outcome.
    - [ ] The steps are generally on a larger scale and thus computationally more efficient.


11. What does the chi-squared test evaluate?
    - [x] <span style='background:yellow'>The likelihood that any observed difference between classes can occur at random. </span>
    - [ ] The likelihood that the data can predict the labels.
    - [ ] Whether the data points across classes share a probability density function.
    - [ ] Whether the p-value of the observations has statistical significance.


12. What is the difference between applying one-hot encoding and dummy encoding on an attribute with N classes?
    - [ ] <span style='background:yellow'>One-hot encoding converts the attribute into N new attributes, whereas dummy encoding converts it into N-1 new attributes.</span>


13. You are producing the word feature set for a standard information retrieval application. You split the text into words and remove capitalization for all common names. What is the next step?
    - [x] <span style='background:yellow'>Filter stop words</span>
    - [ ] Tokenization
    - [ ] Vectorization
    - [ ] Produce the one-hot encoding


14. You are working in a limited resource environment where you need to encode a categorical feature for further processing. What type of encoder is suitable for this task?
    - [ ] A label encoder, because it will map only existing values of the feature to corresponding continuous values.
    - [x] <span style='background:yellow'>A label encoder, because it will map only existing values of the feature to corresponding discrete values.</span>
    - [ ] A one-hot encoder, because it will create a new feature for every encountered unique class.
    - [ ] A one-hot encoder, because it will diminish the feature size by keeping only one class out of all the correlated classes.


15. What is the correct strategy to use when you have missing data in the target column?
    - [ ] Replace the missing values with the Mean
    - [ ] Replace the missing values using Principal Component Analysis
    - [ ] Replace the missing values using Multivariate Imputation using Chained Equations (MICE)
    - [x] Remove the entire row


16. For which type of data is Principal Component Analysis (PCA) NOT the best option?
    - [ ] When the variance of the data does not follow a normal distribution.
    - [ ] When the variance of the data lies in a non-linear manifold.
    - [ ] When the variance of the data is not proportional to the standard distribution.
    - [x] When the variance of the data is better explained by non-orthogonal components.


17. You have a dataset of flight delays and hourly weather readings which include the following data fields: `AirportCode`, `FlightNumber`, `Airline`, `Date`, `FlightDuration`, `DepartureDelayMinutes`, and `ArrivalDelayMinutes`. What steps would you follow to predict whether or not a given flight will be delayed?
    1. Use feature engineering to convert the `ArrivalDelayMinutes` data field to a boolean value.
    2. Train a Binary Classification model using this boolean value as the target.


18. You are building a tool to detect plagiarism for English essays at a national level. What would you do to shrink the search space automatically?
    - [ ] Use local alignment techniques to match documents.
    - [x] <span style='background:yellow'>Use LSH to group documents which are likely to show high similarity.</span>
    - [ ] Use PCA to reduce the dimensionality of the dataset.
    - [ ] Compute a pairwise Jaccard Similarity of all documents.
